Partitioning SQL Server Data for Query Performance Benefits
I spend a lot of time interfacing with SQL engineers of all types (developers, dba's, architects, ETL engineers, etc.) and something I hear very frequently is "I want to horizontally partition my data to improve my query response times to customers". This usually makes me follow up the statement(s) with questions of my own such as "what types of queries are you trying to improve response times on by partitioning your data?" or "what systems are you thinking of employing this type of strategy on ?". Too often I get responses that in turn send us down a path of discussing why partitioning data isn't a viable strategy for improving many types of queries/workloads (such as typical OLTP workloads, seeks of data - particularly seeks that don't filter/join on partitioning keys, etc.). Sometimes engineers and others are downright shocked to see some of results of different query types against partitioned data - usually, by the end of a 1/2 hour demo and discussion and walking through some scripts (which I've attached as well), everything makes good sense and there's a good understanding of why things work the way they do. It's frequently a misconception that using the horizontal partitioning features built-in to the engine in SQL 2005 will lead to large benefits for OLTP-like workloads/queries/etc. (i.e. seeks of data, singleton lookups, etc.) when the reality is that horizontal partitioning is really more about improving performance for things like administrative tasks, bulk data loads/switches, and scans of data when the scanning can be limited by the partition ranges.
This post is not about switching data with partitions, rolling window scenarios, administrative tasks with partitioning, bulk loading/moving data with partitioning, etc. - there is plenty of that in other places (but if someone would like to see some examples, by all means comment or contact me via email and I'll be happy to oblige). Instead, for the remainder of this post we are going to walk through query execution differences for a range of access methods (i.e. seeks, scans, singleton lookups, etc.) against a partitioned structure and a non-partitioned structure that are exactly the same with the exception of one having all indexes partitioned and one not being partitioned at all.
The schema for the partitioned table is as follows:
-- Create a sample table that will mimic the main table...
createtable dbo.Fund_Dtl (
Fund_Dtl_ID int identity(1,1) not null,
fillerColumnchar(150) not null,
Fund_Summary_IDint not null,
Txn_Dtdatetime not null,Partition_Column char(5) not null default ('xyz')
) on [PartitionPSFD] ([Partition_Column]);
createclustered index Fund_Dtl_Fund_Summary_ID_CI on dbo.Fund_Dtl (,
)on [PartitionPSFD] ([Partition_Column]);
altertable dbo.Fund_Dtl with check add constraint Fund_Dtl_pk primary key nonclustered
(Fund_Dtl_ID,Partition_Column) on [PartitionPSFD] ([Partition_Column]);
createnonclustered index ix_TxnDt on dbo.Fund_Dtl (
)on [PartitionPSFD] ([Partition_Column]);
As you can see, we include a clustered index and 2 nonclustered indexes (1 of which is the primary key) - the clustered index includes the partitioning column as the 2nd key column of the index, as does the primary key. Notice however that the 2nd nonclustered index doesn't include the partitioning column as a key column of the index, but instead only as a covered column (as an included column). The script then proceeds to load 602,112 records into the partitioned table, filling the Partition_Column column with 7 distinct values (7 partitions) each with 86,016 records.
Ok, then we create the non-partitioned table as follows:
-- Create another table from the data in partitioned table that isn't partitioned...
-- Index it just like the main table, with no partitioning...
createclustered index c2 on dbo.Fund_Dtl_NoPartition (,
altertable dbo.Fund_Dtl_NoPartition with check add constraint Fund_Dtl_np_pk primary key nonclustered
createnonclustered index ix_TxnDt_np on dbo.Fund_Dtl_NoPartition (
As you can see, the non-partitioned table is identical to the partitioned table in every way with the exception that it isn't partitioned at all - the columns are the same, the indexes are the same, the data is the same, etc.
Ok, at this point the script continues to run 6 different identical tests against the partitioned and non-partitioned tables, and commenting on the performance differences on my machine - I'd encourage you to download the attached script and run through it on your own system to become familiar with the setup, results, etc. (it is heavily commented). If anyone has additional tests they'd like to try out, or if anyone sees any discrepancies, by all means, let me know. I've included a brief summary of the tests included in the script and the results from my machine below for those who don't have access to anything for using the attached script (or mobile readers :-)). Enjoy!
- SEEK (small range-scan) of data on the Fund_Dtl_ID column. This query does not filter on the partitioning column in any way, and we iterate 30,000 executions of the query. In this test, the run against the partitioned table finishes consistently in about 3.7 seconds on my machine, performs 8 scans (partition constant scans) and 14 logical reads. The non-partitioned table finishes consistently in about 2.1 seconds (1.6 seconds faster), performs a single scan and 3 logical reads. Obviously this scenario is won by the non-partitioned table.
- Same as test #1, only we also filter on the partitioning column with an equality against a constant value. The run against the partitioned table improves to 3.4 seconds with a single scan (partition elimination due to the constant equality) and 2 logical reads. The non-partitioned run drops to about the same number, finishing in 3.3 seconds with 3 logical reads and a single scan. The non-partitioned query lost some time due to a residual check of the data for the partitioning column. Call this scenario a tie.
- Same as test #2, only we create 2 additional indexes with the key columns reversed (i.e. Partition_Column then Fund_Dtl_ID). Both runs improve by a second or so in duration, but the results end in a tie again.
- Same as test #2, only instead of an equality against the partitioning column, we use a non-equality filter (>=). The run against the partitioned table finishes in about 5 seconds with 7 scans and 14 logical reads. The non-partitioned column wins significantly again, finishing in about 3.45 seconds with a single scan and 3 logical reads.
- This test introduces a SCAN operation, but for this test there is no filtering for the partitioning column, just a standard scan of the clustered index for each table, and performing only 30 iterations instead of 30,000. The partitioned table run finishes in just over 2.6 seconds consistently with 8 scans and 14,112 logical reads. The non-partitioned test finishes in nearly the same time (just over 2.5 seconds) performing a single scan and slightly less reads with 14,052. Another tie.
- The final test is again a SCAN operation, this time including an additional equality filter on the partitioning column and extending the iterations from 30 to 75. Here is the scenario where we expect partitioning to help immensely with a query, and it doesn't disappoint us for sure. The partitioned table test run finishes in well under a second consistently (850 milliseconds or so) with a single scan and 2,016 logical reads. The non-partitioned table tests take just over 6 seconds consistently, performing a single scan and again performing 14,052 logical reads (just like the last test).
About the author
View all my tips