Learn more about SQL Server tools



solving sql server problems for millions of dbas and developers since 2006
join MSSQLTips for free SQL Server tips














































Populating a SQL Server Test Database with Random Data

MSSQLTips author Tibor Nagy By:   |   Read Comments (5)   |   Related Tips: More > Testing
Problem

If you develop new functionalities for your SQL database then probably you already encountered the typical problem of testing large scale databases. I have to run a series of performance and functional tests on a database with a few million rows, but I do not have the necessary test data. I heard about some excellent commercial tools but they are expensive and my company cannot afford them. How can I generate the test data on my own and populate it to the tables on the test database?

Solution

I will show you some tricks to generate and multiply rows for the test database. Each data type needs a different approach as you can see below.

Numeric data

First of all you will definitely need to generate some numeric data. The hands on idea would be to use the RAND() function. Unfortunately RAND is invoked once per query, not once per row therefore it returns the same value during the course of your query. So I suggest using other methods unless you would like to use a different seed value for each and every call which can be very painful for generating mass volume of data.

If you are looking for a real random value, you will get much better results with generating a NEWID() and calculating the checksum. The following expression sets INTVALUE1 field to a random integer value:

UPDATE TESTTABLE  SET INTVALUE1=CHECKSUM(NEWID())  

Further modifying this expression you can generate decimal values and using the ABS() function you can eliminate the negative numbers:

UPDATE TESTTABLE  SET DECVALUE1=ABS(CHECKSUM(NEWID()))/100.0  

Fixed range data sets

We can use the above method with some changes to generate values within a fixed data range. For example when you need to generate a Boolean value then you can choose from the data sets (0;1) or (Y;N) etc. You can use the parity of the random integer to convert it into Boolean data. Beware; you cannot use case function to sort NEWID results since NEWID is invoked for every CASE statement. The following expression generates random yes or no values for the column BOOLVALUE1:

--Generate random 0 or 1 value for every row  
UPDATE TESTTABLE  SET BOOLVALUE1 = ABS(CHECKSUM(NEWID()))%2  
--Translate the values to Yes or No UPDATE TESTTABLE SET BOOLVALUE1 = 'N' WHERE BOOLVALUE1='0' UPDATE TESTTABLE SET BOOLVALUE1 = 'Y' WHERE BOOLVALUE1='1'

Text data

Text data requires special attention if you would like to have a database with some reasonable content. You have to build an initial dictionary and then use it to breed more rows. For example it is common to use names in various database fields. The following example shows you how to create a table containing 100 different names in a few seconds.

--Create table for first names  
CREATE TABLE [NAMES1] (FIRST_NAME [varchar](20))  
--Create table for family names CREATE TABLE [NAMES2] (FAMILY_NAME [varchar](20))
--Fill first names INSERT INTO NAMES1 VALUES ('John') INSERT INTO NAMES1 VALUES ('Jack') INSERT INTO NAMES1 VALUES ('Jill') INSERT INTO NAMES1 VALUES ('Bill') INSERT INTO NAMES1 VALUES ('Mary') INSERT INTO NAMES1 VALUES ('Kate') INSERT INTO NAMES1 VALUES ('Kevin') INSERT INTO NAMES1 VALUES ('Matt') INSERT INTO NAMES1 VALUES ('Rachel') INSERT INTO NAMES1 VALUES ('Tom')
--Fill family names INSERT INTO NAMES2 VALUES ('Smith') INSERT INTO NAMES2 VALUES ('Morgan') INSERT INTO NAMES2 VALUES ('Simpson') INSERT INTO NAMES2 VALUES ('Walker') INSERT INTO NAMES2 VALUES ('Bauer') INSERT INTO NAMES2 VALUES ('Taylor') INSERT INTO NAMES2 VALUES ('Morris') INSERT INTO NAMES2 VALUES ('Elliott') INSERT INTO NAMES2 VALUES ('Clark') INSERT INTO NAMES2 VALUES ('Rock')
--Generate 10x10=100 different names SELECT * INTO TESTTABLE FROM NAMES1 CROSS JOIN NAMES2

You can take the above examples to quickly generate more and more different database records from a few row sample tables. Cross joins and self joins can easily create large datasets so be cautious when using very big tables. For example cross joining two tables with one thousand records generates results of one million records.

Next Steps


Last Update: 12/9/2010


About the author
MSSQLTips author Tibor Nagy
Tibor Nagy is a technical and business professional from the financial industry. His experience includes SQL Server 2000-2008, DB2 and MySQL.

View all my tips


print tip Print  
Become a paid author




Recommended For You








Learn more about SQL Server tools
Comments and Feedback:
Thursday, December 09, 2010 - 8:29:23 AM - Nas Read The Tip

Good article. Like the suggestion for using joins to create mass data, great time saver.

Just one point on the first insert two statements - think a column length is needed as the subsequent insert fail.

--Create table for first names
CREATE TABLE [NAMES1] (FIRST_NAME [varchar](50))


--Create table for family names
CREATE TABLE [NAMES2] (FAMILY_NAME [varchar](50))

Think 50 should do the trick.

Thanks again.

 


Thursday, December 09, 2010 - 9:48:00 AM - Jerry LaGrou Read The Tip

Um, way to hard.  Also, making your own test data will mean that it will only have features and behaviors you understand before the testing.  Use real source data.

Normally the reason for a smaller test database is so that the big scripts will run fast enough on the test machine to have quick test cycles.  So, first take a full restore of the production data.  Find the big tables, and delete most of the records.  One of my favorite ways of doing this, is to delete all the records where the ID (normally an identity int count-up) is not evenly divisible by one of two prime numbers.

Delete from where ((ID/1129)*1129!=ID) and ((ID/1151)*1151!=ID)

This leaves a consistent 0.2% of the data that has variable spacing between records.  (0.2% is too much, pick larger numbers; 0.2% is not enough, pick smaller)

If your FKs are setup well (cascading delete) this should complete the task, if not a few more deletes and hurray.

If obfuscating customer data is also a requirement, replace all the consonants in the sensitive data with an X, all the vowels with an A.  Now you are ready for tests with realistic values on all the flags, statuses, and other anomalous artifacts in the original data-set.

Happy programming!

 


Thursday, December 09, 2010 - 10:39:26 AM - Tibor Nagy Read The Tip

@Nas:

Thanks for the suggestion. I will correct the typing error.

@Jerry LaGrou:

I provided the tip because there are certain scenarios when SQL developers cannot use production data for testing. I have real life experience on this topic: I was working for an educational software company but the company did not use its own software so we did not have any real data. Another example is the financial industry where the strict security rules prohibit us to restore production data to any test server.


Tuesday, September 13, 2011 - 4:54:08 PM - Ray Read The Tip

I think you could use a case function for the Binary data element.  You only call newId() once per row if you code it this way.

case when checksum(newid()) % 2 = 0 Then 'Y' Else 'N' end

I also support you comments about not having\using production data.  We provided customers with an application that was full of HIPAA related infromation.  Many of them would not let us have a production database backup and the rest should not have.  Additionally, when you are working on a new application then by defintion you don't have access to production data.

Also, the exercise of generating "pseudo" often forces developers to seriously analyze aspecits of their data and its relationships that they otherwise would never consider.


Friday, September 14, 2012 - 1:56:44 PM - boris tyukin Read The Tip

Interesting post but check out this cool open source project http://dataroller.sourceforge.net/

I stumbled upon it a while ago and you can do quite a few things with it.



Post a Comment or Question

Keep it clean and stay on the subject or we may delete your comment.
Your email address is not published. Required fields are marked with an asterisk (*)

*Name   *Email Notify for updates

Signup for our newsletter


Comments
*Enter Code refresh code


 
Sponsor Information







Copyright (c) 2006-2014 Edgewood Solutions, LLC All rights reserved
privacy | disclaimer | copyright | advertise | about
authors | contribute | feedback | giveaways | user groups | community | events | first timer?
Some names and products listed are the registered trademarks of their respective owners.