Learn about Data Warehousing for Analytical Needs
How can an organization analyze its heterogeneous and normalized data in real-time for decision making purposes?
Every mid-scale to enterprise-level organization has data in many operational systems built over the years with different technologies. These operating systems are built to target one or a few operational processes. However, when analyzing these data is needed, organizations find many difficulties, including technical and business. Further, databases are designed to store short-term data while analysis needs short time and, but primarily, long-term data. In addition, most of the transaction and operational systems are designed for highly normalized data that will lead to negative performance impact when you want to analyze the data as it requires a large number of tables to join.
A data warehouse is one of the solutions to facilitate the above said problems. A data warehouse is a collection of comprehensive technologies such as ETL tools for data integration from the data sources, data storage, data staging, reporting, cubes, dashboards, etc. It consists of an Enterprise-wide data analysis framework with access to any level of information to the key people who make the strategic decisions in the organization. When designing a data warehouse, it is important to note that it is designed and the framework heavily depends on the domain you are implementing. For example, the decisions you make to implement a data warehouse for a supermarket are totally different from implementing a teaching institute.
The above data warehouse architecture diagram shows that the data warehouse extracts from multiple data sources, including relational databases, files, web services, images, and videos. Most data warehouses focus on extracting data from standard data such as ERP and HR systems. However, you can enhance the data warehouse's analytical capabilities by extracting social media data and loading it into the data warehouse to increase your business intelligence capabilities.
Designing and bringing data to the data warehouse is one of the most challenging tasks. Once data is in the warehouse, reporting and analysis tasks are much easier. If you wish to perform data mining or predictive tasks, connecting to the data warehouse is better than running them directly on the source systems.
Why Data Warehouses are Essential in Today's World
Just for a minute, imagine that you are running a report from the source systems without a data warehouse. First, you may have to execute queries connected to multiple sources that require distributed transactions. Considering the system's security, operational owners may not allow you to run distributed transactions. Apart from the security concern, distributive queries will add additional overhead to the source systems. As your paramount task is to operate your operational systems smoothly, distributive queries are not entertained.
Further, some databases may not support distributed queries, such as NOSQL databases. In those database systems, you cannot execute distributed queries.
Apart from the distributive queries, operational databases are designed with normalized structures, as operational systems have an equal number of writes and reads. Due to the normalization of table structures, you may have to join multiple tables to achieve some reports, as shown in the following query.
SELECT Production.ProductCategory.Name, YEAR(Sales.SalesOrderHeader.OrderDate) YEAR, SUM(Sales.SalesOrderDetail.LineTotal) Amount FROM Sales.SalesOrderDetail INNER JOIN Sales.SalesOrderHeader ON Sales.SalesOrderDetail.SalesOrderID = Sales.SalesOrderHeader.SalesOrderID INNER JOIN Production.Product ON Sales.SalesOrderDetail.ProductID = Production.Product.ProductID INNER JOIN Production.ProductSubcategory ON Production.Product.ProductSubcategoryID = Production.ProductSubcategory.ProductSubcategoryID AND Production.Product.ProductSubcategoryID = Production.ProductSubcategory.ProductSubcategoryID INNER JOIN Production.ProductCategory ON Production.ProductSubcategory.ProductCategoryID = Production.ProductCategory.ProductCategoryID AND Production.ProductSubcategory.ProductCategoryID = Production.ProductCategory.ProductCategoryID GROUP BY Production.ProductCategory.Name, YEAR(Sales.SalesOrderHeader.OrderDate)
However, as the data warehouses mainly cater to business end users, it is difficult to imagine that they will execute this type of query. In data warehouse design, denormalized or relaxed normalizations are preferred as the data warehouse has more reads. The above query can be rewritten in data warehousing to achieve the same results.
SELECT DimDate.CalendarYear, DimProduct.EnglishProductName, SuM(FactInternetSales.SalesAmount) Amount FROM FactInternetSales INNER JOIN DimDate ON FactInternetSales.OrderDateKey = DimDate.DateKey INNER JOIN DimProduct ON FactInternetSales.ProductKey = DimProduct.ProductKey GROUP BY DimDate.CalendarYear, DimProduct.EnglishProductName
Predictive Analytics on Data Warehouse
More and more organizations are looking at the options of predictive and prescriptive analytics. Since data warehouses have more clean and organized data, it will be much preferred to model predictive models on the data warehouse. Further, predictive analytics need a large volume of historical data. Typically, operational systems store short-term data. Hence predictive analytics will not be able to execute on the operational systems.
Differences Between a Data Warehouse and Operational Systems
As most users are more familiar with operational systems, it is worth comparing and contrasting a data warehouse's technical and business features with an operational system.
|Type of Difference||Operational System||Data Warehouse|
|Age of data||Current with limited historical||Historical|
|Consumption||Transactional and structured||Ad hoc queries and reporting|
|Database Design||Mostly Normalized||Denormalization or relaxed normalization|
|Data Model||Relational (OLTP)||Multi-Dimensional (OLAP)|
|Number of Users||Many||Few|
|Transactional Processing||Primary data from transactions||Transformed secondary data with limited derived data|
|Data Ownership||Data owned by Transactional systems||Only a consumer with limited ownership|
|High Availability and Disaster Recovery||Extremely needed as data is generated in the transactional systems||Not essential, but nice to have|
Considering the above differences between the data warehouse and the operational system, it is safe to declare that the data warehouse needs a separate environment rather than using the same operational system for data analytics.
What are the Microsoft Solutions for an Enterprise Data Warehouse (EDW)?
Microsoft is one of the primary organizations providing data warehouse solutions. Since a data warehouse has comprehensive technologies, many components will cover the requirements of a data warehouse, as shown in the following table.
|Component||On-Premises Solution||Cloud Data Warehouse Solution|
|Extract-Transform-Load||SQL Server Integration Services (SSIS)||Azure Data Factory (ADF)|
|Storage||SQL Server Relational Database Engine||Azure DW Azure DB Azure Data Lake (ADL)|
|Cube||SQL Server Analysis Service – Tabular SQL Server Analysis Service – MDM||Azure Analysis Services|
|Visualization||SQL Server Reporting Services||Power BI|
Sample Data Warehouse
Microsoft has a sample data warehouse for developers to identify the features of a data warehouse. Microsoft has provided mainly two databases: AdventureWorks and AdventureWorksDW. You can download those databases from AdventureWorks sample databases - SQL Server | Microsoft Learn.
- Examine the structure of the Microsoft sample data warehouse and identify the differences between operational and data warehouse structures.
About the author
View all my tips
Article Last Updated: 2023-08-09