SQL Server Business Intelligence Staging Data
By: Siddharth Mehta
In the last chapter we developed the first layer of solution development - OLTP database, which would act as the source system for our analytics requirement. As per our architecture, the next part of the solution is extracting the data from source system and accumulating the data in the staging area. Though we do not necessarily need to implement a staging area for our exercise, we will discuss some key considerations for staging data and the data extraction strategy. It's good to have knowledge of this process as most real-world projects implement a staging area as part of the solution.
In the context of our scenario, we would need to read data from the AdventureWorks OLTP and create a database for the staging area. The schema of the staging area can be identical to that of the source system. To limit the complexity and understand the concepts easily, consider that we have a cloned schema of AdventureWorks called AdventureWorks_Staging, and the intention is to read data from OLTP database and load the changed data into Staging database.
- Delta detection - The first point to check is the system's ability to detect data changes. In the AdventureWorks database tables, most of the tables have auditing fields. These fields are updated whenever there is any change in data. So by using a counter that records the last cut-off time, one can check all the records beyond the cut-off date and consider them as added / updated records.
- Data archival - Records in the OLTP system should always be logically marked as deleted and should not be physically deleted from the tables. If the records are physically deleted, the delta detection technique by using auditing fields will not work, as the deleted records would never get detected. The only option to deal with this limitation of source system would be to setup a copy of the database in the staging area and do a full compare of the data in the staging area with that of source system during data extraction.
- Data volume - Volume of data would provide an estimate of the expected load on the network as well as the ETL server. A huge volume of data would mean that the network should support transporting a large volume of data at high speeds for optimal performance. Provided that the network is able to keep up with the performance requirements and transport data from source system, the ETL server would have to be provisioned with sufficient memory, storage and processing resources to read and write large volumes of data on the staging server.
- Data refresh frequency - Frequency of data processing from source system to the staging area would determine the performance requirements of the ETL to a considerable extent. For example, if the transfer of data from source system to the staging area takes 2 hours for 1 TB of data, and the data is to be refreshed every 1 hour, then the processing window of 2 hours won't be acceptable as before the first cycles completes the next cycle would already start.
- Master data load - ETL processes may be composed of a set of ETL packages that extract data from multiple tables. These packages may run in parallel as well as in sequence depending upon data. For example, we cannot load data in Products table without loading the changes in the Categories and Sub-categories table. If we do so, we may come across scenarios where we find records in the Products table which has a missing sub-category in the parent table as we loaded the Products table prior to Sub-categories table. So master data dependency has to be detected and loaded prior to detail level data.
- Operations control - ETL packages run in a batch where a set of packages are executed in parallel and in sequence. Any ETL package may fail and this can cause data discrepancy if not handled in an managed fashion. So the ETL process should be carefully audited and gracefully handled for exception management at each stage of ETL process.
SQL Server Integration Services (SSIS) is the technology from Microsoft Business Intelligence technology stack that is used to developing ETL solutions. Consider reading the SSIS tutorial to learn and develop your skills.
After the data is staged in the staging area, the same is validated for data quality
and cleansed accordingly.
Data Quality Services
is the technology from Microsoft BI stack
for this purpose. Projects that may want to validate data and/or transform data
against business rules may also create another data repository called a Landing Zone.
The validates and/or transformed data is read by the ETL from the staging area and
then loaded into the landing zone.
- ETL is used in multiple parts of the BI solution, and integration is arguably the most frequently used solution area of a BI solution. Consider creating ETL packages using SSIS just to read data from AdventureWorks OLTP database and write the same into Staging area in the shortest possible time. This would help you to develop your grip on ETL implementation using SSIS.