Data Cleansing: Save $$$ by Getting it Right
by Andrew LaDuke


Data Warehousing (DW) is an exciting concept in the technology arena. It is solving a growing problem: business managers today have access to information from multiple departments. As companies merge with and acquire other companies, they end up with information systems that vary from department to department. Some use Oracle, others use Siebel, still others Microsoft. It's a constantly growing problem. DW can help.

A DW extracts data from a business's many different sources: payroll, sales, manufacturing, logistics, etc. It then stores the data in a central database, one that is structured slightly different from your typical database. Next, as the data comes in, it gets transformed. Finally, "data marts" are set up where users can access the data with just a web browser. They can drop and drag items to create business reports any way they like. These reports can be scheduled to be updated daily, weekly, or monthly. It's a manager's dream.

Before the extracted data, like all sales reports, goes into the DW, it goes through a process called ETL. Extract, transform, & load (ETL) refers to three separate functions combined into a single program. First, the extract function reads data from one of these many different source databases and pulls out the data that you want. Next, the transform function works with this data - using parameters that you set up. Finally, the load function is used to write the resulting data to the data warehouse. Then users can access the data in a user friendly way through the "data marts."

So far so good, right? Wrong. Everyone enters data differently. When a customer calls in and gives their name, the data entry person may input "John Smith" or "Jon Smyth." Data entry is very subjective. How do we address this problem? It's handled by your IT guy using a process called data cleansing! Or is it data scrubbing? Let's take a look.

Garbage in, garbage out!

Managers need accurate data about their customers and business partners. If we build our DW without data cleansing, or standardizing the data, inconsistencies and inaccuracies will distort the data. This will corrupt the data views that the managers are using to make business decisions. These data inconsistencies will be very difficult if not impossible to trace or correct. So it's important to get this step right. It's kind of like painting. If you do the proper prep work, even a lousy paint job will look great!

Do I have to clean it all?

This is what my seven year old says when he turns my basement into a federal disaster relief zone. Very likely not all of the data needs to be scrubbed. Experts agree that only about 20% of the data may meet 80% of the business information requirements. Clean only the data that has real business value. This is good to know, because if you are AT&T or some other huge corporation with six gazillion customers, you definitely don't want to clean all of the data. There are also software tools that can help automate this process. They are not cheap. There are quite a few vendors out there who can provide an application that will meet your needs.

Is it worth it?

What's my ROI?, your boss asks you. A valid question. Let's consider some examples. A leading insurance company discovered a shocking statistic from their new DW. 80% of all their claims involved broken legs. Broken legs? Did everyone in Vail get the same insurance carrier? No. When they investigated, they discovered that the default code in their data entry system was the one for a broken leg. Since the data entry clerks were paid by how fast they entered data, they just went with the default code to move things along.

CRM, customer relationship management, is heavily dependent on accurate data. CRM sometimes uses the data collected for things such as a direct mail campaign. Ever get two or three pieces of junk mail from the same company on the same day? It's very annoying, unless, of course, it's from Victoria's Secret. So data cleansing will help make sure addresses and other types of information are accurate and up to date.

Ameritech used data cleansing to validate millions of customer names and addresses. The project cut the volume of mail they sent out by 4%, saving them about $250,000 each year. The project took nine months and cost between $210,000 and $225,000 to implement. The savings after the first year allowed Ameritech to show a return on its investment very quickly.

Another example is a company that provides products and training for auto improvements such as painting and collision repair. They would purchase a database of prospective customers, such as collision and auto body repair shops. Then they would do a regular mailing of a marketing package that included a catalog and a video. Nearly half of all the packages came back marked "undeliverable." With each package cost amounting to over $10.00, that became expensive. So they turned to an internet based tool that would validate the addresses and provide additional information about each prospect, including annual sales, number of employees, and key executives. They paid about $300 to clean up 14,000 records, then paid $20.00 a month for a subscription. The effect was dramatic: The rate of undeliverable mail dropped from nearly 50% to around 14%, saving the company a large chunk of money.


Businesses today want to know as much as possible about their customers. They are implementing ERP, SCM, & CRM programs to manage their data efficiently. These systems are optimized when the data they are using is accurate. We saw from the Ameritech example that data cleansing can produce a positive ROI in a short period of time.

Are there alternatives? One might be to apply data cleansing techniques to prevent defective data from being input into the original database. So rather than cleaning up the data, you are creating strict data format guidelines for data entry. But we're only human, and humans make mistakes.

Certainly one area to benefit from data cleansing is the rapidly growing E-commerce market. RosettaNet is a consortium of 400 of the world's leading IT companies. They are developing standards for E-commerce. Data cleansing can help ensure that when companies are doing business through XML enabled web services, they can read and communicate data effectively over the internet. As time goes on, keep abreast of data warehousing, I think you'll find its time has come.

Andrew LaDuke is a student at Walsh College, Troy, Michigan, where he will complete a Master's degree in Business Information Technology. He can be contacted by email at:

Many more articles in Information Technology in The CIO Refresher and eBusiness in The CEO Refresher Archives


Copyright 2002 by Andrew LaDuke. All rights reserved.

Current Issue - Archives - CEO Links - News - Conferences - Recommended Reading