After a couple weeks spent enhancing a large stored procedure, named MyDataGet, I'm done with the new version. After considerable work on the performance of the procedure I think it's faster, but it's not faster in every case. How can I demonstrate to skeptical users that by-and-large the new version is faster and they should put it into production?
The right statistics can demonstrate that the new version is "very probably" better than the old one. In this case the statistic that we'll use is the sign test. It's a common statistical test and if you took statistics in college you may have heard of it. It has the advantage that it has few statistical assumptions and that makes it easier to defend.
MyDataGet, is the key stored procedure for an application that I've worked on for a few years. It's the one that my client's production jobs are based on and it does most of the ad hoc querying for the users. They use it instead of querying the data directly because of the need for complex adjustments to the data and for the need to pivot the data into columns that they find easier to read and that their analytical programs usually require.
The data is structured as entity-attribute-value with a table like this:
CREATE TABLE eav ( attribute_id smallint NOT NULL ,[entity_id] int NOT NULL ,[value] varchar(255) NOT NULL ,begin_datetime datetime NOT NULL DEFAULT (GETDATE()) ,end_datetime datetime NOT NULL DEFAULT ('9999-12-31 23:59:59.997') )
MyDataGet has 18 parameters. Many systems have similar procedures that do multiple jobs and have long parameter lists. Usually they've been enhanced and enhanced over time and end up like this:
CREATE PROC dbo.MyDataGet @Arribute_Selection_List1 varchar(8000) = NULL ,@Attribute_Selection_List_Another_Way varchar(8000) = NULL ,@Entity_Selection_List varchar(8000) = NULL ,@Entity_Selection_List_Another_Way varchar(8000) = NULL ,@Datetime_selection_one_way varchar(8000) = NULL ,@datetime_selection_another_way varchar(8000) = NULL ,@Special_Treatment_1 bit = 0 ,@Special_Treatment_2 bit = 0 ,@Special_Treatment_3 bit = 0 .... .... more parameters AS....
The @Attribute_Selection_List% parmeters let the user pick their attributes, which end up as columns. The @Entity_Selection_List% parmeters let the users pick their entities, which end up as rows. @Datetime_selection_% parameters let the users specify the datetime values used to retrieve the data. @Special_treatment_% parameters request adjustments like Winsorization, which is a statistical technique that removes outliers. With so many parameters and such wide range of options, the run-time for calls to MyDataGet ranges from one second up to a couple of hours. I had to put limits on the amount of data that could be requested after a user filled tempdb with a 6 hour query and brought the server to a halt.
When you have two sets of data representing before and after results there are several statistical tests that can tell you if after is really better than before. My first thought was to use a T-Test. Since I have pairs of tests the Paired T-Test sounds like a good fit. However, the T-Test is based on the assumption that the data comes from a normal distribution as you might expect from test scores from a classroom of students. Take a look at the data from my comparison of the before and after versions:
There are big differences between the shortest query, number 1 and the slowest, number 8. Let's compute the averages:
CREATE TYPE dbo.test_results AD TABLE ( test_number smallint NOT NULL , before_seconds float NULL , after_seconds float NULL ) DECLARE @mydataget_results dbo.test_results; INSERT @mydataget_results (test_number, before_seconds,after_seconds) VALUES (1, 1.5, 0.9), (2,2.7,2.9), (3,25.7,22.3),(4,30.9,31.1),(5,22.5,20.1) ,(6,103.2,100.0),(7,422.2,400.1),(8,1737.0,1619.3),(9,19.9,17.3) ,(10,155.0,179.0),(11,200.0,190.0),(12,72.1,70.2),(13,115.9,110.0) ,(14,122.0,109.0),(15,450.1,462.0) SELECT AVG(before_seconds) before_avg_sec , AVG(after_seconds) after_avg_sec , COUNT(test_number) Num_tests , SUM(CASE WHEN before_seconds - after_seconds > 0 THEN 1 else 0 END) after_is_better FROM @mydataget_results
Here are the results:
before_avg_sec after_avg_sec Num_tests after_is_better ---------------- ------------- --------- --------------- 232.046666666667 222.28 15 11
If we make our determination based on averages then query 8 dominates the results and it would only be query 8 that counted towards the decision. That might be okay with you, but my users don't really care much about query 8, which is run by an auditing program. Some users care more about the queries that have users waiting at the screen for the results. Other users care most about the queries used by a production job that goes into their key business process. It's difficult to get agreement about "which of their children are the most important?". I've found that, like children, it's best to treat them equally.
The sign test looks only at the sign of the difference of the results. That is: "is the after measurement better than the before measurement". That takes away the problem of magnitudes. It also takes away the problem of knowing if the data comes from a normal distribution as the T-Test would require. It's a lot like flipping a coin. We're asking, "What the chance that in 15 flips (tests) that 11 of them come out heads (after is better)?" That's just probability.
Because it's just probability, it's not so hard to compute the results of the sign test. We could do it in T-SQL if we really wanted to. Dr. James McCaffrey has an article in the February edition of MSDN magazine a few weeks ago that shows how in C#, Test Run - The Sign Test Using C#. The article also explains a lot about the logic of the sign test. I'd trust Dr. McCaffrey's statistical code more than something that I wrote. I considered putting his C# code into a SQLCLR procedure. But again, why do the extra work and risk introducing errors. SQL Server 2016 includes Microsoft's R Server and a way to invoke R scripts from T-SQL In R the sign test is implemented by the biom function. One only need supply the number of tests and the number of the tests where the new code is faster than the old code.
The interaction between T-SQL and R is the stored procedure sp_execute_external_script. It's similar to the way that sp_executesql allows you to execute dynamically created T-SQL. Here's the script to count the number of tests and the number of successes and pass them to R:
DECLARE @mydataget_results dbo.test_results; INSERT @mydataget_results (test_number, before_seconds,after_seconds) VALUES (1, 1.5, 0.9), (2,2.7,2.9), (3,25.7,22.3),(4,30.9,31.1),(5,22.5,20.1) ,(6,103.2,100.0),(7,422.2,400.1),(8,1737.0,1619.3),(9,19.9,17.3) ,(10,155.0,179.0),(11,200.0,190.0),(12,72.1,70.2),(13,115.9,110.0) ,(14,122.0,109.0),(15,450.1,462.0) declare @number_of_tests int, @after_better int SELECT @number_of_tests = COUNT(test_number) , @after_better = SUM(CASE WHEN before_seconds - after_seconds > 0 THEN 1 else 0 END) FROM @mydataget_results EXECUTE sp_execute_external_script @language = N'R' , @script = N' pvalue <- binom.test(after_better,number_of_tests ,0.5,alternative=''greater'')$p.value; OutputDataSet <- data.frame(pvalue); ' , @params = N'@number_of_tests int, @after_better int' , @number_of_tests = @number_of_tests , @after_better = @after_better ---------------------- 0.059234619140625
The sp_execute_external_script parameters are similar to those for sp_execute_sql and start with the @language parm, which must be N'R'. Microsoft is teasing us with the prospect of future scripting languages. My top candidates are C# scripts and Python. Next comes the @script parameter, which holds the R script. The first two lines of the R script perform the binom.test function, extracts the $p.value, which has the probability, and assigns the results to the variable pvalue. The last line of the R script assigns a new data.frame with pvalue to OutputDataSet. The sp_execute_external_script returns OutputDataSet as the results of the procedure. The input to the R script is the two integer's @number_of_tests and @after_better, which are declared in the @params parameter. These are passed on the next two lines. The results of the query, 0.0592, is shown at the bottom of the script. The columns of this resultset don't get column names. The result of binom.test is a probability that we'd see this result if the before and after had the same performance.
When it comes to statistics the most difficult part is interpreting the results. Our sign tests tells us that if the hypothesis that "the before and after code have the same performance or the before code performs better than the after code" were true we'd see 11 or more "after better" results in 15 trials only 6% of the time. That's not much. The hypothesis is probably not true and we can reject it and decide that the after code is by-and-large faster than the before code. That's enough for me to go ahead with installing the code. If you have to convince skeptical users that your new code is better than the old code, statistics can be a powerful tool. Just keep a straight face when you tell them.
- Install R Services for SQL 2016. You can read about setting up R in the tip SQL R Services for SQL Server 2016.
- Learn a little R and try out sp_execute_external_script
Last Update: 3/15/2017
About the author
View all my tips