Learn more about SQL Server tools

mssqltips logo
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Scripts          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

 

Generate Random Strings with High Performance with a SQL CLR function


By:   |   Last Updated: 2015-04-17   |   Comments (5)   |   Related Tips: More > Testing

Problem

In my work, random strings are useful in many ways. For example, I want to replace all sensitive information of some columns with random strings after I restore the production SQL Server database to a test environment, or I want to generate dummy data for development purposes. So is there a way I can generate random strings easily?

I have the following requirements for the random string generation:

  • I can define the string length to be within a range.
  • I can repeatedly generate the exact same strings if needed, so I can make sure my data quantity and quality are the same.
  • I can generate random string with simple patterns, for example, the postal code in Canada has a format of A1A 1A1, i.e. LetterNumberLetter NumberLetterNumber, such as V3V 2A4 or M9B 0B5.
Solution

There are many ways in T-SQL to generate random strings. Here is one good discussion of this topic "Generating random strings with T-SQL".

Generally speaking, with pure T-SQL, we can use Rand(), NewID(), CRYPT_GEN_RANDOM() and Convert/Cast/Substring T-SQL to create random strings.

However, just using pure T-SQL has two obvious disadvantages:

  1. Non-deterministic functions such as Rand(), NewID() and CRYPT_GEN_RANDOM() are not allowed to be used inside a UDF, which means you need to create some additional layer to bypass this limitation.
  2. For heavy-load string generation, the pure T-SQL solution's performance is compromised.

For string manipulation inside SQL Server, the majority agree that a CLR function will be better positioned. So in this tip, I will provide two CLR functions to meet the above-mentioned requirements.

  1. Generate a random string with its length specified, and also with a seed parameter, we ensure repeatablility with the same random string when using the same seed.
  2. Generate a random string with a simple pattern defined by the pattern parameter, this function also has a seed parameter to ensure repeatable string generations.

I will not repeat the steps about how to create/deploy an assembly with Visual Studio, but you can refer to the links in [Next Steps] section to find the details.

using System;
using System.Data;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Security.Cryptography;
using System.Text;

public partial class UserDefinedFunctions
{
    [Microsoft.SqlServer.Server.SqlFunction]
    public static SqlString fn_random_string(SqlInt32 minLen, SqlInt32 maxLen, SqlInt32 seed)
    {
        int min_i = (int)minLen;
        int max_i = (int)maxLen;

        int i = 0;
        if (min_i <= 0 || min_i > max_i)
        { return new SqlString(string.Empty); }
        else
        {
            int sd = (int)seed;
            Random r = new Random();
            if (sd != 0)
            {
                r = new Random(sd);
            }

            i = r.Next(min_i, max_i + 1);
            byte[] rnd = new byte[i];
            using (var rng = new RNGCryptoServiceProvider())
            {
                rng.GetNonZeroBytes(rnd);
                string rs = Convert.ToBase64String(rnd);
                rs = rs.Substring(0, i);
                return new SqlString(rs);
            }
        }
    } //fn_random_string


    public static SqlString fn_random_pattern(SqlString pat, SqlInt32 seed)
    {
        string pattern = pat.ToString();
        if (pattern == string.Empty)
        { return new SqlString(string.Empty); }
        else
        {
            string CharList = "abcdefghijklmnopqrstvvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
            string NumList = "0123456789";
            char[] cl_a = CharList.ToCharArray();
            char[] nl_a = NumList.ToCharArray();
            int sd = (int)seed;
            Random rnd = new Random();
            if (sd != 0)
            {
                rnd = new Random(sd);
            }

            StringBuilder sb = new StringBuilder(pattern.Length);

            char[] a = pattern.ToCharArray();
            for (int i = 0; i < a.Length; i++)
            {
                switch (a[i])
                {
                    case '@':
                        sb.Append(cl_a[rnd.Next(0, CharList.Length)]);
                        break;
                    case '!':
                        sb.Append(nl_a[rnd.Next(0, NumList.Length)]);
                        break;
                    default:
                        sb.Append(a[i]);
                        break;
                }
            }//for
            return new SqlString(sb.ToString());
        }//else
    } // fn_random_pattern
} //UserDefinedFunctions

In my case, after I build the application to generate a DLL file, which I put it under c:\MSSQLTips\Random_String\bin\ folder, I need to run the following to import the DLL into SQL Server 2012.

use MSSQLTips -- this is my test database
alter database MSSQLTips set trustworthy on;
exec sp_configure 'clr enabled', 1;
reconfigure with override
go

create assembly clr_random_string
from 'C:\mssqltips\Random_String\bin\CLR_Rand_String.dll'
with permission_set = safe
go

create function dbo.ucf_random_string (@minLen int, @maxLen int, @seed int =0)
returns nvarchar(max) with execute as caller
as 
external name [clr_random_string].[UserDefinedFunctions].fn_random_string;
go

/*
@pattern: @ means one letter from a to z (both lower and upper cases), ! means one digit number, i.e. 0 to 9. Anything else will not change.
so if we have a @pattern='abc !! def', then we may have strings like 'abc 12 def' or 'abc 87 def' generated.
For canadian post code the pattern can be '@[email protected] [email protected]!' (the middle blank space will be kept as it is in the generated string, like 'V1A 2P5'
*/
create function dbo.ucf_random_pattern (@pattern nvarchar(max), @seed int=0 )
returns nvarchar(max) with execute as caller
as 
external name [clr_random_string].[UserDefinedFunctions].fn_random_pattern;
go

We can use the following code to generate some random strings:

Use MSSQLTips
-- generate a single random string
select RandStr=dbo.ucf_random_string(10, 30, default)
, RandPattern=dbo.ucf_random_pattern('What is the time, Mr. @@@@@@? It is ! am', default);

-- generate random Canada Post Code / US zip code
select top 10 Canada_PostCode= upper(dbo.ucf_random_pattern('@[email protected] [email protected]!', row_number() over (order by column_id)))
, US_ZipCode= dbo.ucf_random_pattern('!!!!!-!!!!', ceiling(rand(column_id)) + row_number() over (order by column_id))
from sys.all_columns

Here is the result:

Query_Result

Performance Comparison between CLR and T-SQL

Here I compare the execution of usp_generateIdentifier as seen on stackoverflow.com and my CLR version dbo.ucf_Random_String, I run each 20,000 times and for each 2000 times, I will record the duration as the script runs. The test code uses the same parameters for both the T-SQL Stored Procedure and the CLR function.

Here is the test code:

-- test performance between CLR and T-SQL
Use MSSQLTips;
set nocount on;
declare @i int=1, @start_time datetime = getdate(); 
declare @str varchar(8000), @seed int;
declare @t_tsql table (run_count int, duration_ms int); -- for tsql execution stats
declare @t_clr table (run_count int, duration_ms int); -- for clr execution stats
-- run tsql solution 20,000 times
while @i <= 20000
begin
set @seed = @i;
exec dbo.usp_generateIdentifier
@minLen = 2000
, @maxLen = 4000
, @seed = @seed output
, @string = @str output;

if (@i % 2000 = 0)
  insert into @t_tsql (run_count, duration_ms)
  select @i, datediff(ms, @start_time, getdate());
  set @i = @i+1;
end

select @i = 1, @start_time = getdate(); -- reinitialize variable
-- run clr solution 20,000 times
while @i <= 20000
begin
set @seed = @i;
select @str = dbo.ucf_random_string(2000, 4000, @seed)
if (@i % 2000 = 0)
  insert into @t_clr (run_count, duration_ms)
  select @i, datediff(ms, @start_time, getdate());
set @i = @i+1;
end

select t1.run_count, tsql_duration=t1.duration_ms, clr_duration=t2.duration_ms
from @t_tsql t1
inner join @t_clr t2
on t1.run_count = t2.run_count;

I put the results into an Excel sheet and graphed the data as shown below:

Performance Comparison
Next Steps

I like this CLR approach especially because its assembly permission is to set to SAFE, meaning I never need to worry that any future .NET DLL patches will break this CLR code.

Please read the following articles to know more about how to work with CLR functions.

This tip's code has been tested in Visual Studio 2013 and SQL Server 2012 environment. It should be applicable to SQL Server 2008 and above as well.



Last Updated: 2015-04-17


get scripts

next tip button



About the author
MSSQLTips author Jeffrey Yao Jeffrey Yao is a senior SQL Server consultant, striving to automate DBA work as much as possible to have more time for family, life and more automation.

View all my tips




Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Email me updates 


Signup for our newsletter
 I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.



    



Saturday, April 18, 2015 - 2:20:08 AM - Jeff Moden Back To Top

Oh... I almost forgot.  SQL Server isn't so good at concatenation even using XML to do it with.  The problem of generating random strings with 1000 to 2000 characters is one of those places where it's really bad compared to your SQL CLR method.  Still, it's not as bad as using nested WHILE loops like I've seen other folks do.

And, yes.  This could also be turned into a function but I wanted to prove that you don't actually need a special function to do this.

 
    SET STATISTICS TIME ON
;
DECLARE
  @pMinLength INT = 1000
        ,@pMaxLength INT = 2000
;
--===== Generate 20,000 random length strings according to the parameters above.
 SELECT RandomString =
            ((
             SELECT CHAR(ABS(CHECKSUM(NEWID()))%95+32)
               FROM dbo.fnTally(0,ABS(CHECKSUM(NEWID()))%(@pMaxLength-@pMinLength)+@pMinLength)t2
              WHERE t1.N > 0
                FOR XML PATH(''), TYPE
            ).value('(./text())[1]','varchar(max)'))
   FROM dbo.fnTally(1,20000)t1
;
    SET STATISTICS TIME OFF
;


Here are the run results.  Like I said, compared to your good SQL CLR, it's really slow.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
(20000 row(s) affected)
 SQL Server Execution Times:
   CPU time = 18938 ms,  elapsed time = 19123 ms.


 


Saturday, April 18, 2015 - 1:27:33 AM - Jeff Moden Back To Top

@Jeffrey Yao,

Thank you for the fine feedback and your testing.  And, you're absolutely correct.  "Better" is definitely in the eye's of the beholder.  Your fine function has the advantage of being very flexible where the code that I wrote is rather rigid in solving the given problem.  It's usually true that dedicated code is faster than more flexible code but the more flexible code, provided that it performs as well as yours does, can allow for faster development by people that may not have the knowledge or the time to write the more dedicated code.

Shifting gears a bit, my real purpose in posting was two fold.  The first was to simply demonstrate that the T-SQL code need not be ass slow as the code from StackOverflow was.

The other purpose was to provide an alternative for the good folks that may need to get a similar job done but can't use SQL CLR code because of a DBA's insistence or some company rule (the U.S. Government is infamous for it) that disallows the use of SQL CLR.  It happens much more often than many would think.

Thank you again for the article and for the testing.


Friday, April 17, 2015 - 3:50:16 PM - jeff_yao Back To Top

Ok, I have run the code server times (on my local laptop sql server 2012 enviormemnt) , and each time, I run dbcc freeproccache() and dbcc dropcleanbuffers() before running the real code

so running @Jeff Moden's code, I will get the following

(20000 row(s) affected)

 SQL Server Execution Times:

   CPU time = 47 ms,  elapsed time = 178 ms.

While running

 SET STATISTICS TIME ON;

 select Canada_Post=dbo.ucf_Random_pattern('@[email protected] [email protected]!', N), US_Post=dbo.ucf_random_pattern('!!!!!-!!!!', N)

 from dbo.fnTally(1, 20000)

The result is:

(20000 row(s) affected)

 SQL Server Execution Times:

   CPU time = 484 ms,  elapsed time = 493 ms.

You see, @Jeff Moden's t-sql code is much better.

However, the reason is obvious, because in CLR for this patternized random string, there is a pattern parsing and loop logic, which can be optimized in algorithm to remove loop at least.

To @Jeff Moden, on the other hand, if I want to create a random string with lenght between X and Y (such as 1000 to 2000), can you have a better way using pure T-SQL?

For example, if I want to create 20000 random strings with length between 1000 and 2000, I can run the following code

-- dbcc freeproccache()

 --dbcc dropcleanbuffers()

 

 SET STATISTICS TIME ON;

 select dbo.ucf_random_string(1000, 2000, N)

 from dbo.fnTally(1, 20000)

 SET STATISTICS TIME OFF;

I will get the following result.

(20000 row(s) affected)

 SQL Server Execution Times:

   CPU time = 858 ms,  elapsed time = 1135 ms.

 

Thanks @Jeff Moden for your involvement in this interesting discussion, I believe this type of discussion may benefit lots of people.


Friday, April 17, 2015 - 2:14:05 PM - jeff_yao Back To Top

Thanks @Jeff Moden, I will run your code later and make a comparsion with my CLR code and update back here.


Friday, April 17, 2015 - 10:40:23 AM - Jeff Moden Back To Top

Hi Jeffrey,

Good article.  Thank you for taking the time to put it together.

I will, however, suggest that the race between the CLR method and the T-SQL method was "fixed" in this case because of the extreme RBAR built into the code that you got from StackOverflow caused by the gross misunderstanding of that author on how to quickly generate random characters in T-SQL.

To demonstrate, you'll need to create the following T-SQL function.  Most of the code is documentation (I hope the formatting comes out nicely on this forum.).

 

CREATE FUNCTION [dbo].[fnTally]
/**********************************************************************************************************************
 Purpose:
 Return a column of BIGINTs from @ZeroOrOne up to and including @MaxN with a max value of 1 Billion.

 As a performance note, it takes about 00:02:10 (hh:mm:ss) to generate 1 Billion numbers to a throw-away variable.

 Usage:
--===== Syntax example (Returns BIGINT)
 SELECT t.N
   FROM dbo.fnTally(@ZeroOrOne,@MaxN) t
;

 Notes:
 1. Based on Itzik Ben-Gan's cascading CTE (cCTE) method for creating a "readless" Tally Table source of BIGINTs.
    Refer to the following URLs for how it works and introduction for how it replaces certain loops.
    http://www.sqlservercentral.com/articles/T-SQL/62867/
    http://sqlmag.com/sql-server/virtual-auxiliary-table-numbers
 2. To start a sequence at 0, @ZeroOrOne must be 0 or NULL. Any other value that's convertable to the BIT data-type
    will cause the sequence to start at 1.
 3. If @ZeroOrOne = 1 and @MaxN = 0, no rows will be returned.
 5. If @MaxN is negative or NULL, a "TOP" error will be returned.
 6. @MaxN must be a positive number from >= the value of @ZeroOrOne up to and including 1 Billion. If a larger
    number is used, the function will silently truncate after 1 Billion. If you actually need a sequence with
    that many values, you should consider using a different tool. ;-)
 7. There will be a substantial reduction in performance if "N" is sorted in descending order.  If a descending
    sort is required, use code similar to the following. Performance will decrease by about 27% but it's still
    very fast especially compared with just doing a simple descending sort on "N", which is about 20 times slower.
    If @ZeroOrOne is a 0, in this case, remove the "+1" from the code.

    DECLARE @MaxN BIGINT;
     SELECT @MaxN = 1000;
     SELECT DescendingN = @MaxN-N+1
       FROM dbo.fnTally(1,@MaxN);

 8. There is no performance penalty for sorting "N" in ascending order because the output is explicity sorted by
    ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

 Revision History:
 Rev 00 - Unknown     - Jeff Moden
        - Initial creation with error handling for @MaxN.
 Rev 01 - 09 Feb 2013 - Jeff Moden
        - Modified to start at 0 or 1.
 Rev 02 - 16 May 2013 - Jeff Moden
        - Removed error handling for @MaxN because of exceptional cases.
**********************************************************************************************************************/
        (@ZeroOrOne BIT, @MaxN INT)
RETURNS
TABLE WITH SCHEMABINDING AS
 RETURN WITH
  E1(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
            SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
            SELECT 1)                                  --10E1 or 10 rows
, E3(N) AS (SELECT 1 FROM E1 a, E1 b, E1 c)            --10E3 or 1 Thousand rows
, E9(N) AS (SELECT 1 FROM E3 a, E3 b, E3 c)            --10E9 or 1 Billion rows                
            SELECT N = 0 WHERE ISNULL(@ZeroOrOne,0)= 0 --Conditionally start at 0.
             UNION ALL
            SELECT TOP(@MaxN) N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E9 -- Values from 1 to @MaxN
;

Then, to mimic your longest test of 20,000 rows, please run the following.

 

SET STATISTICS TIME ON;
 SELECT TOP 20000
          Canada_PostalCode =
             CHAR(ABS(CHECKSUM(NEWID()))%26+65)
            +CHAR(ABS(CHECKSUM(NEWID()))%10+48)
            +CHAR(ABS(CHECKSUM(NEWID()))%26+65)
            +' '
            +CHAR(ABS(CHECKSUM(NEWID()))%10+48)
            +CHAR(ABS(CHECKSUM(NEWID()))%26+65)
            +CHAR(ABS(CHECKSUM(NEWID()))%10+48)
        ,US_ZipCode =
            RIGHT(ABS(CHECKSUM(NEWID()))%99999+100001,5)
            +'-'
            +RIGHT(ABS(CHECKSUM(NEWID()))%9999+10001,4)
   FROM dbo.fnTally(1,20000)
;

I will agree that it does take a bit more thought but both of these columns could easily be generated by a high performance iTVF.  It would be worth it because then you don't have to worry about CLR code and it appears to be faster to boot.

 


Learn more about SQL Server tools