Infrastructure Planning for a SQL Server Data Warehouse
By: Dinesh Asanka | Comments (2) | Related: More > Database Administration
A SQL Server data warehouse has its own characteristics and behavioral properties which makes a data warehouse unique. In this sense, a data warehouse infrastructure needs to be planned differently to that of a standard SQL Server OLTP database system. In this tip we look at some things you should think about when planning for a data warehouse.
SQL Server Data Warehouse System Parameters
A data warehouse itself has its own parameters, so each data warehouse system has its own unique features. When deciding on infrastructure for the data warehouse system, it is essential to evaluate many parameters. Out of these parameters, the main parameters are Data Volume, Reporting Complexity, Users, System Availability and ETL.
As you may know, Volume is one of the seven properties of big data. Unlike in transactional systems, data warehouse systems tend to store historical data as well as data with multiple domains and systems. This means that the volume of the data in the data warehouse will be large and increasing rapidly.
In cases of data warehousing, there are four types of reporting: Descriptive, Diagnostic, Predictive and Prescriptive. A data warehouse is the framework for analytics, which means that reporting users should have the option of executing ad-hoc queries. Also, there are reports that will use a high number of tables with different types of joins and a high number of aggregations.
Typically, a data warehouse solution must support a combination of the following types of query:
- Simple: Relatively straight forward SELECT queries with a single fact table and few dimension tables.
- Medium: Repeatedly executed queries that include aggregations or many joins.
- Complex: Ad-hoc queries with complex aggregations, joins, and calculations. Also, this category of queries contains data mining and predictive analytics.
Number of Users
Typically, a data warehouse has a smaller number of users than transactional systems. However, since large queries are executed for analytical purposes over a substantial time period, concurrency is a concern.
Sometimes, depending on the geography distribution of data warehouse users, there is a need to have operating system time slots. Also, planned down time and unplanned outages can affect Availability.
ETL (Extract-Transformation-Load) is an essential component of the data warehouse. For some data warehouses, daily ETL is adequate. Actually, the majority of data warehouses ETL falls into this category. There are some data warehouses which have a couple of ETL jobs during the day and other ETL jobs will be executed during off-peak hours. There are a few cases where some data warehouses need real time data.
As you can see from these parameters, a data warehouse system can be combinations of multiple complexities of these parameters. Hence it is difficult to judge which category the data warehouse falls into.
The following table consists these multiple parameters with a different scale of the system.
|Parameter \ Scale||Small||Medium||Large|
|Data Volume||Less than 1 TB||1 to 10 TB||More than 10 TB|
|Reporting Complexity||Simple – 60 % Medium – 30 % Complex – 10 %||Simple – 50 % Medium – 40 % Complex – 10 %||Simple – 20 % Medium – 50 % Complex – 30 %|
|Number of Users||100 Users 10 Concurrent users||1000 Users 100 – 200 concurrent users||1000 concurrent users|
|Availability||Typical business hours||1-2 hrs of down time||24x7|
|ETL||One ETL per day||Intra Day ETL||Real Time Data|
As it is difficult to chose the scale of the data warehouse, by looking at the above parameters you can get an idea about scaling your data warehouse.
Types of Workloads
After analyzing the capacities of the data warehouse, the next step is to analyze the workloads of the data warehouse. Typical workloads of data warehouse are ETL, Data Model and Reporting.
Typically, ETL extracts data from transactional systems, heterogeneous sources and transforms them to suit the analytical platform which is the data warehouse. During the extraction phase, there will be an IO and Memory load on the source systems. As you should not interrupt the source system at any cost, proper planning needs to be done on the extraction so that it won't impact the source systems. Transformation typically takes places at the data warehouse end. Since transformations need more calculation power which means CPU consumption will be high along with Memory usage. The loading of data will also require more IO on the data warehouse system. Since data is coming from multiple source, typically network bandwidth could be a concern for network administrators during the ETL process.
In most of the technologies, an additional layer on top of the data warehouse is created in order to improve performance of reporting and analytics. For example, in case of SQL Server SSAS Multi-Dimensional cubes, SSAS Tabular and in case of Oracle, Hyperion cubes are available. In this layer, data will be read from the data warehouse and processed into the data model layer. After the ETL, these data models need to be processed in order to keep the data in sync. In this model layer, aggregated data will be stored, hence processing of data models are high CPU and IO operations. Also, aggregations are memory intensive operations.
Reports and Analytics
Reports and Analytics are the endpoints for the end users. In case of reports, more chances are that the reports will gather large volumes of data. In case Reports are consuming the data model, concerns will be on the reporting server end. In case of Analytics, if data mining algorithms are used, high CPU will be consumed as data mining algorithms consume CPU.
Also, there options such as data driven subscriptions and the standard subscriptions in the reporting platform especially in the case of SQL Server Reporting Services (SSRS). Since reports are writing to disk such as Word, Excel or PDF files, there could be quite a high percentage of IO utilized.
Apart from typical operations on the data warehouse platform, there are other maintenance tasks that needs to be done.
Indexes are used for better performance of data retrieval. Since there are less writes to the data warehouse, administrators have the option of creating many indexes. Also, in case of data warehousing, columnstore indexes can be created. When these indexes are present, it requires indexes to be rebuilt in order to avoid index fragmentation and improve overall performance. As said before, there can be large number of indexes in the data warehouse with large volumes of data, so when rebuilding indexes the process could consume a large amount of CPU and IO.
Data backups are not essential as the data is usually generated from other source systems. However, it is a good idea to backup the data warehouse as it can be helpful to recover if needed rather than rebuilding everything from the scratch. Since a data warehouse generally has a large volume of data, backups can use a lot of CPU and IO on the system.
- For further reading, please visit the following links:
- SQL Server Analysis Services Tutorial
- SQL Server Integration Services Tutorial
- SQL Server Reporting Services Tutorial
- SQL Server Business Intelligence Tips
About the author
View all my tips