Azure SQL Database - Table Partitioning
The Gartner Group specializes in surveying leading companies and determining industry trends in Information Technology. It is not surprising that cloud computing and big data (information explosion) are on top of the 2015 technology trending list.
In December of 2014, Microsoft release the preview version of Azure SQL Database update V12. The main purpose of this version is to narrow the syntactical differences between the on-premises and in-cloud database engines. The hope is that more companies will migrate their data to this software as a service platform.
Given these trends, the main question a database administrator might have is "How can I manage larger tables in Azure SQL Database?".
The new version of Azure SQL database has introduced table partitioning. This feature is part of the enterprise only version for on premises, but is available in all versions in the cloud.
To demonstrate this new feature we need to have a fictitious business problem. Since one of my majors in college was applied mathematics, I am going to solve a math problem.
Calculate and store the primes numbers from 1 to 1 million with ten data partitions. Thus, the primes numbers will be hashed in buckets at every one hundred thousand mark.
The trial division algorithm that I am going to introduce is a brute force method for calculating prime numbers. It is great for comparing the computing power of two machines by looking at overall execution times.
This routine consists of dividing a number n by each integer m which is greater than 1 and less than or equal to the square root of n. If the result of any of these divisions is an integer, then n is not a prime; otherwise, it is a prime.
Creating the database
I am using my MSDN ultimate license which comes with a free $150 per month Azure subscription. This is a great way to learn about what Azure has to offer without any real investment.
This demonstration assumes you have a Azure Database Server already created with a valid login. The server login I created is named jminer. It is important to record the web address of the server (connection string) since this will be used in SSMS. The image below shows the V12 preview has been enabled.
I will be using SQL Server Management Studio (SSMS) 2014 with cumulative update 5 installed to design and deploy the solution. I will be referring to the Azure Portal to review the results of our work.
To connect to our Azure Database server, enter the connection information using SQL Server standard authentication.
One statement that is still not supported is the USE statement. This limitation can be overcome by selecting the correct database in the object explorer and right clicking to open a new query window. I will be leaving this statement in the code since they are a reminder of what database you should be in. Executing this statement in the wrong database generates an error.
To verify the server version and default database, we can use the db_name() and @@version statements.
The code below recreates the MATH database.
/* Create a database to hold the prime numbers */ -- Which database to use. USE [master] GO -- Delete existing database IF EXISTS (SELECT name FROM sys.databases WHERE name = N'MATH') DROP DATABASE MATH GO -- Create new database CREATE DATABASE MATH ( MAXSIZE = 20GB, EDITION = 'STANDARD', SERVICE_OBJECTIVE = 'S2' ) GO
It is interesting to note two new keywords have been introduced to describe database type. I will be investigating this new syntax in my next tip.
Creating the Partition Function and Scheme
The key concept behind any type of horizontal partitioning is to group similar records into a single file group and/or file. In turn, this changes major record operations into file operations. For instance, DELETE all data with partition value Y turns into a remove file operation. Searching for data with partition value Y as part of the WHERE clause directs the storage engine to retrieve data from that one file.
The overall benefits should result in increased speed. However, like most things in life your delta might vary.
The main question that comes to mind is "How do we do create a partition scheme in Azure since we have no control over file placement?
The product team has assured me that mapping ALL the partitions to the PRIMARY file group will be optimized by the storage engine in Azure.
The diagram below is a conceptual view of how table partitioning works for our example in Azure SQL database.
The code below creates a partition function named PF_HASH_BY_VALUE and partition scheme named PS_HASH_BY_VALUE.
/* Use table partitioning */ -- Which database to use. USE [MATH] GO -- Create the partition function CREATE PARTITION FUNCTION PF_HASH_BY_VALUE (BIGINT) AS RANGE LEFT FOR VALUES (100000, 200000, 300000, 400000, 500000, 600000, 700000, 800000, 900000) GO -- Show the function SELECT * FROM sys.partition_functions GO -- Create the partition scheme CREATE PARTITION SCHEME PS_HASH_BY_VALUE AS PARTITION PF_HASH_BY_VALUE ALL TO ([PRIMARY]); GO -- Show the scheme SELECT * FROM sys.partition_schemes GO
The output from querying the system tables is shown below.
A simple call to the $PARTITION system function can be used to test the hash index. The example below uses a derived table with key boundary values.
-- Test partition function SELECT MY_VALUE, $PARTITION.PF_HASH_BY_VALUE(MY_VALUE) AS HASH_IDX FROM ( VALUES (1), (100001), (200001), (300001), (400001), (500001), (600001), (700001), (800001), (900001) ) AS TEST (MY_VALUE); GO
The output from the test is shown below.
Creating the Partitioned Table
The TBL_PRIMES table contains three columns. The first one is the value of the prime number. The second one is how many divisions were tried before the number was declared prime. The third one is the date and time the data was stored. The first column is chosen as the primary key for the table.
The code below creates the new table with the partitioning scheme implemented on the primary key.
/* Create a table to hold the prime numbers */ -- Which database to use. USE [MATH] GO -- Delete existing table IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[DBO].[TBL_PRIMES]') AND type in (N'U')) DROP TABLE [DBO].[TBL_PRIMES] GO -- Add new table CREATE TABLE [DBO].[TBL_PRIMES] ( [MY_VALUE] [bigint] NOT NULL, [MY_DIVISION] [bigint] NOT NULL CONSTRAINT [CHK_TBL_PRIMES] CHECK ([MY_DIVISION] - 0), [MY_TIME] [datetime] NOT NULL CONSTRAINT [DF_TBL_PRIMES] DEFAULT (GETDATE()) CONSTRAINT [PK_TBL_PRIMES] PRIMARY KEY CLUSTERED ([MY_VALUE] ASC) ) ON PS_HASH_BY_VALUE ([MY_VALUE]) GO
User defined stored procedures
First, we need a procedure that takes a number as a parameter and determines if it is prime. In this example we will use an old fashion WHILE loop. Some relational algebraic purest might argue that we should use a TALLY table. However, this is only a simple example focused on table partitioning.
The code below creates the procedure named SP_IS_PRIME.
/* Create a procedure to determine if number is prime */ -- Which database to use. USE [MATH] GO -- Delete existing procedure IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[SP_IS_PRIME]') AND type in (N'P', N'PC')) DROP PROCEDURE [dbo].[SP_IS_PRIME] GO -- Create the stored procedure from scratch CREATE PROCEDURE [dbo].[SP_IS_PRIME] @VAR_NUM2 BIGINT AS BEGIN -- NO DISPLAY SET NOCOUNT ON -- LOCAL VARIABLES DECLARE @VAR_CNT2 BIGINT; DECLARE @VAR_MAX2 BIGINT; -- NOT A PRIME NUMBER IF (@VAR_NUM2 = 1) RETURN 0; -- A PRIME NUMBER IF (@VAR_NUM2 = 2) RETURN 1; -- SET UP COUNTERS SELECT @VAR_CNT2 = 2; SELECT @VAR_MAX2 = SQRT(@VAR_NUM2) + 1; -- TRIAL DIVISION 2 TO SQRT(X) WHILE (@VAR_CNT2 <= @VAR_MAX2) BEGIN -- NOT A PRIME NUMBER IF (@VAR_NUM2 % @VAR_CNT2) = 0 RETURN 0; -- INCREMENT COUNTER SELECT @VAR_CNT2 = @VAR_CNT2 + 1; END; -- A PRIME NUMBER RETURN 1; END GO
Second, we need a procedure that takes a starting and ending value as input and calculates and stores primes numbers between those two values as output. This procedure will allow us to run multiple calls in parallel against Azure SQL Database at the same time.
The code below creates the procedure named SP_STORE_PRIMES.
/* Create a procedure to store primes from x to y. */ -- Which database to use. USE [MATH] GO -- Delete existing procedure IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[SP_STORE_PRIMES]') AND type in (N'P', N'PC')) DROP PROCEDURE [dbo].[SP_STORE_PRIMES] GO -- Create the stored procedure from scratch CREATE PROCEDURE SP_STORE_PRIMES @VAR_ALPHA BIGINT, @VAR_OMEGA BIGINT AS BEGIN -- NO DISPLAY SET NOCOUNT ON -- DECLARE VARIABLES DECLARE @VAR_CNT1 BIGINT; DECLARE @VAR_RET1 INT; -- SET VARIABLES SELECT @VAR_RET1 = 0; SELECT @VAR_CNT1 = @VAR_ALPHA; -- CHECK EACH NUMBER FOR PRIMENESS WHILE (@VAR_CNT1 <= @VAR_OMEGA) BEGIN -- ARE WE PRIME? EXEC @VAR_RET1 = DBO.SP_IS_PRIME @VAR_CNT1; -- FOUND A PRIME IF (@VAR_RET1 = 1) INSERT INTO [DBO].[TBL_PRIMES] (MY_VALUE, MY_DIVISION) VALUES (@VAR_CNT1, SQRT(@VAR_CNT1)); -- INCREMENT COUNTER SELECT @VAR_CNT1 = @VAR_CNT1 + 1 END; END GO
Many of the SQL Server tools that come with the on-premises edition work the same way for the cloud edition. I am going to leverage the SQLCMD utility in a batch program. The command line interpreter has the start keyword that can be used to launch a program asynchronously. Putting all this concepts together with the right calls to SP_STORE_PRIMES, we can calculate the prime numbers in ten even batches.
The command file below calls our user defined stored procedure to solve our business problem.
REM REM Calculate primes numbers <= 1M asynchronously. REM REM [Partition 1] start cmd /c sqlcmd -S codf58h5ey.database.windows.net,1433 -U jminer -P SQLtip$2015 -d MATH -Q "EXEC SP_STORE_PRIMES 1, 100000;" REM [Partition 2] start cmd /c sqlcmd -S codf58h5ey.database.windows.net,1433 -U jminer -P SQLtip$2015 -d MATH -Q "EXEC SP_STORE_PRIMES 100001, 200000;" REM [Partition 3] start cmd /c sqlcmd -S codf58h5ey.database.windows.net,1433 -U jminer -P SQLtip$2015 -d MATH -Q "EXEC SP_STORE_PRIMES 200001, 300000;" REM [Partition 4] start cmd /c sqlcmd -S codf58h5ey.database.windows.net,1433 -U jminer -P SQLtip$2015 -d MATH -Q "EXEC SP_STORE_PRIMES 300001, 400000;" REM [Partition 5] start cmd /c sqlcmd -S codf58h5ey.database.windows.net,1433 -U jminer -P SQLtip$2015 -d MATH -Q "EXEC SP_STORE_PRIMES 400001, 500000;" REM [Partition 6] start cmd /c sqlcmd -S codf58h5ey.database.windows.net,1433 -U jminer -P SQLtip$2015 -d MATH -Q "EXEC SP_STORE_PRIMES 500001, 600000;" REM [Partition 7] start cmd /c sqlcmd -S codf58h5ey.database.windows.net,1433 -U jminer -P SQLtip$2015 -d MATH -Q "EXEC SP_STORE_PRIMES 600001, 700000;" REM [Partition 8] start cmd /c sqlcmd -S codf58h5ey.database.windows.net,1433 -U jminer -P SQLtip$2015 -d MATH -Q "EXEC SP_STORE_PRIMES 700001, 800000;" REM [Partition 9] start cmd /c sqlcmd -S codf58h5ey.database.windows.net,1433 -U jminer -P SQLtip$2015 -d MATH -Q "EXEC SP_STORE_PRIMES 800001, 900000;" REM [Partition 10] start cmd /c sqlcmd -S codf58h5ey.database.windows.net,1433 -U jminer -P SQLtip$2015 -d MATH -Q "EXEC SP_STORE_PRIMES 900001, 1000000;"
Even though I have been in the IT industry for a quarter century, I still test and re-test my solutions to make sure that my algorithms work correctly for both positive and negative testing.
One question that a tester might have is "How do I know the data was stored in the correct partition?"
The Azure preview V12 has exposed over 100 new dynamic management views that the database administrator can use for monitoring and troubleshooting. The sys.dm_db_partition_stats view can be used to answer such a question. However, I already introduced the $PARTITION system function that can obtain the same answer.
The code below shows how to investigate row counts by partition number.
/* Validate data placement */ -- Use dmv to get partitions SELECT Partition_Number, Row_Count FROM sys.dm_db_partition_stats WHERE object_id = object_id('TBL_PRIMES'); -- Using the $PARTITION function SELECT $PARTITION.PF_HASH_BY_VALUE([MY_VALUE]) as Partition_Number, COUNT(*) as Row_Count FROM MATH.[dbo].[TBL_PRIMES] GROUP BY $PARTITION.PF_HASH_BY_VALUE([MY_VALUE]);
The output from the test is shown below.
- Azure V12 has introduced edition and service objective to provide different levels of processing power.
How do these offerings compare to a on-premises edition?
- Many times table partition goes hand in hand with a sliding window for data warehousing. The current solution for prime numbers less than or equal to 1 million is static and
does not account for adding and subtracting partitions.
How do you implement such a concept?
- There are many other techniques that can be used to perform horizontal partitioning. One technique is database sharding.
Microsoft has released the Azure SQL database elastic scale library for .NET which is a sharding solution.
How can we take advantage of the library for our own business problems?
Last Updated: 2015-01-29
About the author
View all my tips