Infrastructure Planning for a SQL Server Data Warehouse

By: Dinesh Asanka | Comments (2) | Related: More > Database Administration

Problem

A SQL Server data warehouse has its own characteristics and behavioral properties which makes a data warehouse unique. In this sense, a data warehouse infrastructure needs to be planned differently to that of a standard SQL Server OLTP database system. In this tip we look at some things you should think about when planning for a data warehouse.

Solution

SQL Server Data Warehouse System Parameters

A data warehouse itself has its own parameters, so each data warehouse system has its own unique features. When deciding on infrastructure for the data warehouse system, it is essential to evaluate many parameters. Out of these parameters, the main parameters are Data Volume, Reporting Complexity, Users, System Availability and ETL.

Data Volume

As you may know, Volume is one of the seven properties of big data. Unlike in transactional systems, data warehouse systems tend to store historical data as well as data with multiple domains and systems. This means that the volume of the data in the data warehouse will be large and increasing rapidly.

Reporting Complexities

In cases of data warehousing, there are four types of reporting: Descriptive, Diagnostic, Predictive and Prescriptive. A data warehouse is the framework for analytics, which means that reporting users should have the option of executing ad-hoc queries. Also, there are reports that will use a high number of tables with different types of joins and a high number of aggregations.

Typically, a data warehouse solution must support a combination of the following types of query:

Simple: Relatively straight forward SELECT queries with a single fact table and few dimension tables.
Medium: Repeatedly executed queries that include aggregations or many joins.
Complex: Ad-hoc queries with complex aggregations, joins, and calculations. Also, this category of queries contains data mining and predictive analytics.

Number of Users

Typically, a data warehouse has a smaller number of users than transactional systems. However, since large queries are executed for analytical purposes over a substantial time period, concurrency is a concern.

Availability

Sometimes, depending on the geography distribution of data warehouse users, there is a need to have operating system time slots. Also, planned down time and unplanned outages can affect Availability.

ETL

ETL (Extract-Transformation-Load) is an essential component of the data warehouse. For some data warehouses, daily ETL is adequate. Actually, the majority of data warehouses ETL falls into this category. There are some data warehouses which have a couple of ETL jobs during the day and other ETL jobs will be executed during off-peak hours. There are a few cases where some data warehouses need real time data.

As you can see from these parameters, a data warehouse system can be combinations of multiple complexities of these parameters. Hence it is difficult to judge which category the data warehouse falls into.

The following table consists these multiple parameters with a different scale of the system.

Parameter \ Scale	Small	Medium	Large
Data Volume	Less than 1 TB	1 to 10 TB	More than 10 TB
Reporting Complexity	Simple – 60 % Medium – 30 % Complex – 10 %	Simple – 50 % Medium – 40 % Complex – 10 %	Simple – 20 % Medium – 50 % Complex – 30 %
Number of Users	100 Users 10 Concurrent users	1000 Users 100 – 200 concurrent users	1000 concurrent users
Availability	Typical business hours	1-2 hrs of down time	24x7
ETL	One ETL per day	Intra Day ETL	Real Time Data

As it is difficult to chose the scale of the data warehouse, by looking at the above parameters you can get an idea about scaling your data warehouse.

Types of Workloads

After analyzing the capacities of the data warehouse, the next step is to analyze the workloads of the data warehouse. Typical workloads of data warehouse are ETL, Data Model and Reporting.

ETL

Typically, ETL extracts data from transactional systems, heterogeneous sources and transforms them to suit the analytical platform which is the data warehouse. During the extraction phase, there will be an IO and Memory load on the source systems. As you should not interrupt the source system at any cost, proper planning needs to be done on the extraction so that it won't impact the source systems. Transformation typically takes places at the data warehouse end. Since transformations need more calculation power which means CPU consumption will be high along with Memory usage. The loading of data will also require more IO on the data warehouse system. Since data is coming from multiple source, typically network bandwidth could be a concern for network administrators during the ETL process.

Data Model

In most of the technologies, an additional layer on top of the data warehouse is created in order to improve performance of reporting and analytics. For example, in case of SQL Server SSAS Multi-Dimensional cubes, SSAS Tabular and in case of Oracle, Hyperion cubes are available. In this layer, data will be read from the data warehouse and processed into the data model layer. After the ETL, these data models need to be processed in order to keep the data in sync. In this model layer, aggregated data will be stored, hence processing of data models are high CPU and IO operations. Also, aggregations are memory intensive operations.

Reports and Analytics

Reports and Analytics are the endpoints for the end users. In case of reports, more chances are that the reports will gather large volumes of data. In case Reports are consuming the data model, concerns will be on the reporting server end. In case of Analytics, if data mining algorithms are used, high CPU will be consumed as data mining algorithms consume CPU.

Also, there options such as data driven subscriptions and the standard subscriptions in the reporting platform especially in the case of SQL Server Reporting Services (SSRS). Since reports are writing to disk such as Word, Excel or PDF files, there could be quite a high percentage of IO utilized.

Maintenance Workloads

Apart from typical operations on the data warehouse platform, there are other maintenance tasks that needs to be done.

Indexes Rebuild

Indexes are used for better performance of data retrieval. Since there are less writes to the data warehouse, administrators have the option of creating many indexes. Also, in case of data warehousing, columnstore indexes can be created. When these indexes are present, it requires indexes to be rebuilt in order to avoid index fragmentation and improve overall performance. As said before, there can be large number of indexes in the data warehouse with large volumes of data, so when rebuilding indexes the process could consume a large amount of CPU and IO.

Backups

Data backups are not essential as the data is usually generated from other source systems. However, it is a good idea to backup the data warehouse as it can be helpful to recover if needed rather than rebuilding everything from the scratch. Since a data warehouse generally has a large volume of data, backups can use a lot of CPU and IO on the system.

Next Steps

For further reading, please visit the following links:

About the author

Dinesh Asanka is a 10 time Data Platform MVP and frequent speaker at local and international conferences with more than 12 years of database experience.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips

Thursday, September 9, 2021 - 11:49:27 AM - Michael	Back To Top (89220)
Good article, thanks, Dinesh! There are a few points that would be a good addition to your article. for example, types of DWH schema, business requirements, how to build reports and many others. You can find all of them in this article: https://skyvia.com/blog/sql-server-data-warehouse-the-easy-and-practical-guide