Previous Table of Contents Next


Auditing Logs

The easiest sources to extract data from will have an auditing function built into them. The data warehouse extraction team needs to understand and develop applications to work with these auditing processes. Data warehouses that work with Oracle databases often pull the changed data from the redo logs. The redo logs are a convenient place to find data that has changed. It also avoids changing the application code or interfering with the performance of the operational database. In essence, a data warehouse that is pulling information from an Oracle operational database can do so without the database or application being affected in any way. Most modern databases contain these types of logs. The logs enable the database to recover if an error occurs in its processing. One of the tricks of using this extraction tool is determining the format of the log files. The log files are usually stored in an internal binary format. Software packages designed to extract data from logs may already exist for some databases. Platinum and BMC have developed packages that can read Oracle redo logs, as well as Microsoft SQL Server transactions logs and IBM DB2 logs. Older or unpopular technologies may not have commercial software available that can read their logs. If this is the case, the extraction team must either develop a custom utility or find another method for refreshing the data warehouse.

Full Refreshes with Data Selection Performed Inside the Data Warehouse

Another way to refresh the data warehouse is to pull all of the data from the operating environment to the data warehouse on a nightly basis. When the data warehouse receives the data, it must determine which data to keep. This solution has many problems.

First, the amount of data transferred to the data warehouse can be enormous. For example, a company may have one terabyte of important data in its operational systems. Every night, one terabyte of data must be pulled from the operating databases, across the network, and into the data warehouse. After the data warehouse receives the terabyte, it must perform heavy processing that will discard approximately 95% of the data. This process could take days to complete. Since most data warehouses have an 8-hour refresh window, this simply is not possible. That much data cannot be moved through the entire process in such a short amount of time.

A second problem with pulling all of the data from the operating database is that the operating database will slow down considerably during the refresh process. The users of the operational database will notice an enormous degradation of query response time. A data warehouse team that imposes these types of problems on the operational team will not be very popular.

Code Changes in the Existing Applications

The third way to refresh the data warehouse is to code changes in the operational applications to report the data that has been inserted, updated, or deleted. This may also be done at the database level through triggers, snapshots, and possibly advanced replication. This is usually the worst solution. The data warehouse team must write code for the data warehouse, as well as rewrite and debug code that may have existed for years.

Determining the changed data can be quite a bit of work. After the data has been identified, the team must determine what must be done to the data to prepare it for the data warehouse tables. There are several types of scrubbing actions the data warehouse team must identify.

The first action is to change the operational data to adhere to standards determined in the design phase. These standards must be applied to all operational data sources. For example, one application may store the date as mm/dd/yy in a character field. The Oracle database stores the date as a 7-byte date code that records the date down to the second. An application that stores the date in the mm/dd/yy format may require extra logic structures in the extraction process. In a human resources application it is important to understand that the century has not been recorded. A year of 40 may indicate 1940 when an employee was born, or it may indicate 2040 when the employee may begin withdrawing from his or her 401k retirement account.

A second example of standards is how sex is recorded. One application may record sex as male and female. Another application may record sex as “m” and “f.” This type of conversion is simple to code, but difficult due to the large number of fields in different applications that must be converted.

A second action the extraction scrubbing team must address is data merging. Data merging occurs when information that needs to be stored in a single table or column in the data warehouse is actually pulled from several tables or columns in the operational environment. For example, a company with an order database on a UNIX machine and a shipping database on an NT platform may find the need to pull information from both databases to create a complete record of sales. The extraction process must pull data from each environment and determine how the data will be merged through a series of complex rules.

The extraction team is also responsible for populating summary tables. These summary tables are stored inside the data warehouse and are based on other data inside the warehouse (see Figure 30.4). The data warehouse team may decide to populate summary tables at a different time than the extraction. However, this solution will result in a large increase in disk I/O. A better solution is to have the extraction processes build the summary tables as they are building the base tables.


Figure 30.4.  Extraction process that populates summary tables.

The extraction team must also update metadata in the following areas:

  Descriptions about the data
  Refresh schedule
  Size of the data
  Groups permitted to use this table
  Original location of the data

The maintenance of the metadata should be an automatic process that is performed by the extraction process.


Previous Table of Contents Next
Используются технологии uCoz