Learn more about SQL Server tools



solving sql server problems for millions of dbas and developers since 2006
join MSSQLTips for free SQL Server tips














































Retrieving random data from SQL Server with TABLESAMPLE

MSSQLTips author Greg Robidoux By:   |   Read Comments (2)   |   Related Tips: More > Functions - System

Problem
In a previous tip we talked about how to randomly retrieve records from the database.  This tip took a look at both the RAND() function and the NEWID() function.  Both of these approaches are helpful, but one problem with both of these is that depending on how you use them you may end up reading through your entire table to get a random distribution of data.  The good part about this is that you are pretty much guaranteed that you will get a totally random set of records each time.  Because of the potential overhead with reading through the entire table are there any other approaches to randomly retrieving data?

Solution
In SQL Server 2005 a new option has been added to the FROM clause.  This new option is the TABLESAMPLE feature. With the TAMPLESAMPLE option you are able to get a sample set of data from your table without having to read through the entire table or having to assign temporary random values to each row of data.  At first glance this sounds great, but there are a few things to keep in mind when using this new option.

The way this works is that data is read at a page level.  Each 8K page for the table is given a random value and based on this random value and the value you specify either number of rows or percentage will determine how many rows are returned.  So based on this, each run can retrieve a totally different number of rows of data.

How to use it

To use TABLESAMPLE, this clause is put after the FROM clause as follows:

  • ...FROM tableName TABLESAMPLE (10 PERCENT)
  • ...FROM tableName TABLESAMPLE (1000 ROWS)

You would think that the 1000 ROWS option would return 1000 rows, but what happens is that this number is converted to a percent prior to execution based on the number you specified and the approximate number of rows in the table.  So once again you are not guaranteed an exact number of rows.

Here is a sample query against the AdventureWorks database.

SELECT * FROM Sales.SalesOrderDetail TABLESAMPLE (1000 ROWS)

Here are a few sample runs using the above statement:

Run # # of Rows
1 1288
2 658
3 806
4 1232
5 1064

As you can see none of these executions returned 1000 rows.  The total number of rows is pretty random.  To help make this return an exact number of rows you can use the TOP command as well such as:

SELECT TOP 250 * FROM Sales.SalesOrderDetail TABLESAMPLE (1000 ROWS)

By using the TOP command with a smaller number than the sample rows we are pretty much guaranteed to get the number of rows we are expecting from the query.

Another thing to keep in mind is that if you make the ROWS value too small there is a chance that you can get no data returned from the query.  In my tests when I set this value to (100 ROWS) some of the executions returned no data at all.  So keep this in mind.

Another thing to be aware of is that the data is pulled page by page, not row by row.  So based on this your sampling dataset will be as diverse as the data is that is stored on each data page. So based on our example. the clustered index is on the SalesOrderID.  This value is not all that random across each page of data and therefore we have groups of data that are random versus a complete random set of data at a row level.

Here is another query to further illustrate this.

SELECT TOP 10 * FROM Sales.SalesOrderDetail TABLESAMPLE (1000 ROWS)

As you can see from the five executions below the SalesOrderID is the same for all of the executions except for run #4.  So since the data is stored based on the SalesOrderID a sample of only 10 records will not be all that random.

SalesOrderID returned

Run #1 Run #2 Run #3 Run #4 Run #5
47967
47967
47967
47967
47967
47967
47967
47967
47967
47967
50208
50208
50208
50208
50208
50208
50208
50208
50208
50208
43850
43850
43850
43850
43850
43850
43850
43850
43850
43850
44311
44311
44312
44312
44312
44312
44313
44313
44313
44313
44127
44127
44127
44127
44127
44127
44127
44127
44127
44127

In addition to randomly retrieving data you can all use the REPEATABLE option so that the query returns the same random set of data each time you run the query.  Again this assumes that your data has not changed.

SELECT TOP 10 * FROM Sales.SalesOrderDetail TABLESAMPLE (1000 ROWS) REPEATABLE (25)

When the above query is run the sample set of data that is returned will be the same each time the query is executed.

Next Steps

  • Here is another new SQL Server 2005 option that you can take advantage of.  See if this is a better and faster approach to randomly retrieving data from your tables.
  • Take a look at this other tip that discusses randomly pulling data:SQL Server Randomly Retrieve Records
  • Read more about the TABLESAMPLE option


Last Update: 8/14/2007


About the author
MSSQLTips author Greg Robidoux
Greg Robidoux is the President of Edgewood Solutions and a co-founder of MSSQLTips.com.

View all my tips


print tip Print  
Become a paid author




Recommended For You








Learn more about SQL Server tools
Comments and Feedback:
Thursday, April 26, 2012 - 1:54:17 AM - Amin Read The Tip

i want to retrive the 10% data from the table what is the query for that?


Friday, June 08, 2012 - 4:53:18 AM - coolansh63 Read The Tip

i found this article useful:

how-to-choose-random-record-from-sql-server-database/



Post a Comment or Question

Keep it clean and stay on the subject or we may delete your comment.
Your email address is not published. Required fields are marked with an asterisk (*)

*Name   *Email Notify for updates

Signup for our newsletter


Comments
*Enter Code refresh code


 
Sponsor Information







Copyright (c) 2006-2014 Edgewood Solutions, LLC All rights reserved
privacy | disclaimer | copyright | advertise | about
authors | contribute | feedback | giveaways | user groups | community | events | first timer?
Some names and products listed are the registered trademarks of their respective owners.