Previous Table of Contents Next


The extraction team must keep several things in mind while developing extraction and cleansing processes.

  The processes must be fast. The team must determine the amount of data that could potentially be processed by these routines in the future. The code must be fast enough to accomplish its work within the 8-hour refresh period.
  The operational technologies will continue to change. An example of this is the upgrade of an operational environment from Oracle7 to Oracle8. In the upgrade, Oracle Corporation has changed the format of the redo logs. An extraction process based on the Oracle7 redo logs may have to be re-written. It is also important that the data extraction team understands the future intentions of the operating environments. It may be beneficial to choose an extraction tool that will keep up with the underlying technology. For example, a data warehouse pulling data from an Oracle database can use software from a vendor such as Platinum or BMC for the data extractions. As the Oracle format changes from Oracle7 to Oracle8, the data warehouse team need only upgrade their Platinum or BMC software to continue processing the new data. These vendors keep a close relationship with Oracle to track and prepare for the coming changes. A close relationship with a large software company like Oracle will be difficult for an individual data warehouse team.
  The extraction team must keep in constant contact with the operational teams. Operational data changes often. Tables are moved, renamed, and deleted. It is not uncommon for the extraction team to bring over erroneous or incomplete data due to unknown operational changes. This can result in passing on inaccurate information to the users who make decisions based on this data. False information has the potential of sending a company in the wrong direction. Thus the relationship between the extraction and operational teams is paramount to the validity of the data warehouse.
  The extraction processes should be robust. They must be able to recover from a failure without starting over. The data set can be so large that it may not be possible to kick off the entire process again.

The extraction processes must also verify the data before it is released to the users. This verification process performs several basic functions. One of these functions can use statistical formulas to determine the probability of data validity.

For example, a company may have always had between one and two million dollars in sales per day. If new data from the operational environments shows sales of only one hundred thousand dollars per day, the data must be verified by a data analyst before it is released to the user community. This type of verification performs three functions.

First, it checks the validity of the data warehousing extraction processes. In this case, the extraction process may have performed an improper math function as it pulled the data into the warehouse.

Second, it checks the validity of the operational processes. For this anomaly, the operational processes may have lost the records of some sales.

Third, the verification process may alert management to business problems. If the total sales has actually fallen by 90 percent, management will want to get this information as quickly as possible.

The verification process can be worth its weight in gold. Because almost all operational systems will tie into the data warehouse, the verification process can act as a company-wide alarm system.


Note:  
A Fortune 500 company had a problem in their accounts receivable system. It had underbilled customers for a total loss of over 30 million dollars. The data warehouse verification process detected the error and reported it to a data warehouse analyst. The money that was recovered was enough to pay for the entire data warehouse.

What About Data Marts?

Data marts are defined in several ways throughout the industry. Unfortunately, the term has never settled into one definition. This would explain the confusion that takes place when the data mart topic is brought up.

Most of the definitions can be classified into two major categories (see Figure 30.5).


Figure 30.5.  Two definitions of data marts.

  The first definition defines the data mart as a piece of the data warehouse. It is used to build the data warehouse in small steps. After all data marts have been created, they will be merged into one data warehouse. Data marts will pull data directly from the operational environments.
  The second definition defines the data mart as a subset of data that is pulled from the data warehouse. The data is pulled from the operational environments into the data warehouse. The data marts can choose what data they would like to see from the data warehouse. The data is pulled into the data marts and massaged into a more usable format.

Data marts are a fairly recent addition to the data warehousing movement. They have become increasingly useful for several reasons.

First, they can be developed in a short period of time (usually 3 to 6 months per data mart). Due to the rapid development pace, they may begin paying for themselves sooner than a data warehouse.

Second, the designers will learn to build better data marts as they receive feedback from existing data marts. Usually the team has become much more sophisticated in data mart design after building its first data mart.

Another reason for the popularity of data marts is the ability to pick a small set of data for the first iteration of the data mart process. Several factors are considered.

  What data is the easiest to extract from the operating environments?
  What data will be the most utilized by management?
  What data mart best resembles the rest of the data warehouse project?

After management has seen the benefits of a data mart, it is easier to receive funding for additional data marts or an entire data warehouse.

Another benefit of data marts is cost. A data mart usually costs several hundred thousand dollars to implement. This relatively small amount of money is easier for management to approve than a multi-million-dollar data warehouse. Due to the fact that financing is often done at the departmental level, it’s often easier to gain approval for a data mart because the department understands how the data mart will be used. It is much harder to find financing at the corporate-wide level.


Previous Table of Contents Next
Используются технологии uCoz