Common Data Warehouse Development Challenges
Data warehouses have gone through a long evolution in the last few decades. Today's data warehouses are much larger than their predecessors, with RDBMS and Data Lake components. However, regardless of the size or technology platforms, they're based on, there're some well-known common challenges and development principles, that apply to most data warehouses. This tip is part of the series dedicated to the building of the new generation of data warehouses, called lakehouses (see Building Scalable Lakehouse Solutions using Azure Synapse Analytics), and we're going to discuss here common data warehouse building challenges.
Data warehousing modelling approach-the Kimball model
There're many data warehousing modelling approaches, but the Kimball model is probably the most popular. This model was introduced by Ralph Kimball at the end of the last century and has quickly become an industry standard. There are several modelling rules at the core of this approach, designed to build high-quality and efficient data warehouses. We'll be focusing on few rules, outlined below:
- Dimension denormalization. Denormalization means consolidation of the related dimensions, to achieve better performance. For example, if your transactional model contains Product and Product Category tables, it makes sense to merge them under Product dimension, when you design a data warehouse model because the denormalization allows to eliminate unnecessary table joins (which can be costly on large tables) and create efficient, intuitive, and easier to understand data models. The goal of the denormalization is to build a star-like schema, with the fact tables in the center, linked to dimensional tables. Of course, achieving star-like schema may not always be possible, in which cases a snowflake model with fewer dimensions is recommended.
- Classification of dimensions. The Kimball model classifies the dimension tables into few categories, based on the need to preserve the change history (this is also called Slowly Changing Dimensions or SCD). Although Kimball's full classification includes seven SCD types, we'll be focusing here on the two most common types, listed below:
Type 1. This approach requires overwriting the older version when the new row version arrives. This approach makes sense for static tables in which history is not important. For example, if you have a City dimension with the state, country details, it's reasonable to expect these details will never change.
Type 2. This approach requires adding new versions of the dimensions, to preserve
historical changes. For example, if you have a Product dimension the product color,
packaging size and category, some of them may change over time and businesses may
be interested in tracking changes to each product. A common design approach for
such dimension tables is adding a surrogate key, date fields representing the start/end
times, as well as a field indicating the active/expired status of each dimension
version. Here's a quick example. Say, we have the following row dimension
in the source database for the Product dimension:
|Product Id||Name||Package size|
We could preserve the change history on this dimension, by creating the following structure:
|SurrKey||Product Id||Name||Package size||EffectiveFromDate||EffectiveToDate||IsActive|
- Building time dimensions. Facts often contain date/time fields and Kimball recommends creating time dimensions, based on this data, and replacing date/time fields with the pointers to this table.
- Classification of facts. Kimball classifies fact tables into three categories (see here for more details):
- Transaction fact tables. This type represents non-periodic facts (i.e. the facts that are not tied to specific periods). An example of this type could be a sales table.
- Periodic snapshot fact tables. This type represents the periodic facts captured at specific periods. An example of this could be a daily inventory table.
- Accumulating snapshot tables. This type represents the facts that capture different stages of the complex process. An example of this could be an order processing table, which allows tracking the statuses of the different stages of the order, across the manufacturing process.
Although we'll be focusing here on the Kimball model, most of this discussion would still apply to other models as well.
Common data warehouse dimensional modelling tasks
Now, let's discuss the development tasks required to address the abovementioned rules.
Dimension building is probably one of the most difficult parts of data warehouse implementation. While type-1 SCD dimensions are easier to implement, the type-2 SCD dimension tables require the generation of unique surrogate keys, time lifespan fields, etc. The common approach for generating surrogate keys in the past, when data warehouses were based on RDBMS databases, was to use the auto-increment column feature. However, this approach does not work for Delta Lake tables, due to their distributed nature. Fortunately, many modern Big Data transformation platforms have functionalities to generate unique keys programmatically.
The generation of the time dimension is another challenge. To address it, the ETL pipelines need to extract timestamp data from the fact tables, store it in the time dimension and replace related time attribute values within the fact tables with the corresponding time dimension keys.
One of the common fact-related building challenges is to populate fact tables linked to type-2 SCD dimensions. Because these dimensions require surrogate keys, the recommended approach is to replace business keys coming from the source with the corresponding surrogate keys within the fact tables.
Another common fact processing challenge is handling the late-arriving dimensions. In a perfect world, all the dimension attributes are known at the time of the fact table processing. However, in a real-life scenario, some fact rows arrive with the dimension keys that do not match any existing dimension keys (in other words, none of the dimension details, except business keys, are known at the time of processing). There're few common approaches for handling these cases:
- Exclusive approach. Do not allow the facts having mismatches to flow into the target fact tables. These records can be stored in a temporary location for invalid rows and processed later when the matching dimension rows arrive.
- Inclusive approach. Add special dimension rows with the business keys from the source and mark their empty descriptive columns as 'Unknown'. Allow the facts having mismatches into the target fact tables and link them to these special dimensions. With this approach, ETL jobs need to check such special dimensions on further iterations and overwrite them, when the required dimensions details arrive. Although this requires little more development efforts, it's a preferable approach, as it allows more accurate reporting.
Another variation of this problem is related to the facts that arrive with the empty dimension keys. These cases can be handled similar way, with a slight variation - adding a single dummy record with an empty business key into the dimension table and point all fact rows with empty keys to that dimension row.
As you may have guessed, all the above mentioned approaches trying to eliminate orphan facts (i.e. rows with the dimension keys pointing nowhere).
Finally, most large fact tables require the implementation of an incremental upload logic, as it's often unreasonable to do a full upload of the source data. The common solution here is using the watermark tables with the last processed keys or timestamps).
The other Data Warehousing tasks
Apart from the dimensional modelling tasks, a typical data warehouse building project includes the following tasks:
- Creating the staging and dimensional tables.
- Cleansing, de-duplicating and enriching the source data.
- Populating the staging and dimensional tables.
- Capturing the pipeline upload statistics (i.e. number of rows written to the target tables).
Now that we've discussed common data warehouse building challenges, let's see how to address them using Azure Synapse Analytics. In the next few posts, I'll explain how to address challenges discussed here and demonstrate how to create Delta Lake tables and build a Synapse Analytics pipeline to populate them.
- Read: Dimensional Modeling Techniques
- Read: The 10 Essential Rules of Dimensional Modeling
- Read: Star schema
- Read: Late Arriving Dimensions
About the author
View all my tips
Article Last Updated: 2021-10-20