When a new ETL development kick-off takes place, there are a lot of considerations to keep in view for planning the low-level design. Using SQL Server Integration Services (SSIS), an ETL solution is implemented in the form of packages. ETL solutions can be for a variety of projects ranging from Data Migration and Cleansing to a Data Warehousing Load and Refresh.
From an implementation perspective, SSIS packages can be categorized as follows based on the functionality:
Transform and Load Packages
Maintenance / Housekeeping Packages
Cube Refresh Packages
In this series of tips, we will discuss points to keep in view while planning the low-level design of any ETL solution for different flavors of package development. In this tip we start with the discussion of points to keep in view for the implementation of Extract Packages using SSIS.
The following is a list of things to consider before beginning a new ETL project or if you want to modify the process you already have in place.
1 - How Many Developers
Generally extract packages are fairly simple in design and the main motive is to extract the delta (the set of records that have been modified (inserted / deleted (logically) / updated) since the previous extract), unlike load and transform packages which can be different on a case by case basis.
A single developer can suffice to develop extract packages, or in a typical large data warehousing implementation, one developer per logical area of extract would be an advisable distribution of work.
Parallel development by multiple developers on a single package is tricky, having one developer per logical area of extract makes the task much easier. When there is the need to have multiple developers work on the same package you should consider implementing version control.
2 - Number of Extract Packages
The first question that comes to one's mind when kicking off development is to develop a single package or multiple packages for the extract. It can be one package that can be generic enough to read the delta from the transaction system or it can be multiple packages divided in terms of logical business entities that reads the delta from corresponding tables. By dividing a large number of table extracts into multiple packages reduces the complexity of the package and if any issue occurs during the extract, it is limited to a logical area which is more desirable and easier to troubleshoot.
For example, if the source feeds are coming in the form of an Excel or flat files having the same structure, it would be easy enough to read these Excel files in an iterative loop and stage the data reads to a staging area which can be easily accomplished using a single package. But in the case of a data warehouse load, there would be generally more than one extract package which would extract data from some OLTP (On Line Transactional Processing) system it would make more sense to utilize multiple packages.
3 - Staging Strategy
Many ETL cycles are developed for a mid-sized business system which would have a delta of a few thousand records only. For such a system a permanent staging area might not be implemented, in which case the staging would be done in some temporary form. This also determines the extract package development strategy.
In such systems, the delta is extracted from transactional or detail level tables, but Master Tables would require a full extract. Master tables do not change that frequently and for deriving the attributes of dimensions, most of the time data is required from these tables. So a full inventory of master tables should be created and these master tables should be either kept in a permanent staging area (if one exists) and can be updated every time, in case a delta is detected. Or another way to manage this, is to get a full extract every time of all master tables if a permanent staging area is not maintained.
4 - Staging Destination
When data is extracted from the source system, the same needs to be staged from where Load and Transform packages would pick up this data. If the staging strategy includes a permanent staging area, the staging destination generally remains SQL Server. But if the staging goes to temporary storage there are multiple options to choose from such as: like Text Files, CSV Files, SQL Server and RAW files. Mostly the selection remains either RAW files and SQL Sever and both have its own set of advantages and limitations.
I suggest going with SQL Server if this option is available without incurring a heavy cost. Otherwise, raw files are a good option apart from the limitation that it makes staging data limited to be read by SSIS only.
5 - Master Package Integration
When there is more than one extract package, they need to be executed and controlled thru some mechanism. This can be facilitated using either a SQL Server Agent Job or developing a master package that executes all of the child packages.
There are multiple benefits of implementing a master package, which are as follows:
Master package can read configuration values and pass on the values to respective child extract packages, which would centralize changes.
As execution of all child packages is done from this package, it is easier to log execution statistics and control execution precedence of each package programmatically.
Master packages open a plug-in interface for other driver packages to execute child packages programmatically, for example the main driver package which would run the entire refresh cycle can easily plug-in to drive this package.
6 - Delta Configuration
Extract Packages are normally required to read the delta from the OLTP system and stage is at the staging area. After each delta is extracted, all the details regarding the last extract are recorded or updated. This becomes the effective configuration for the next delta read. Maximum care should be taken to record these details as the entire delta would be dependent on these details.
If a Master package is used to drive extract packages, this package should read the delta settings and pass these values on to respective child packages, so that the precise and uniform delta is extracted from all the data units.
Details regarding the delta extract are generally stored in some metadata repository which can be even as simple as a configuration collection table in SQL Server. After all the extract packages extract the delta, the master package can update these settings as the last step in the package.
Although this is not a detailed step by step process for setting up an ETL process, it should give you some things to consider before beginning your project.
It is much easier to plan and implement things a certain way instead of having to go back and retrofit everything once you have already started with a different strategy. So take the time to think about these items and how they would work for your enviornment.