Data Sampling in SQL Server Integration Services


By:   |   Updated: 2016-12-20   |   Comments (2)   |   Related: More > Integration Services Development

Problem

When developing SSIS packages, sometimes it is easier to work with a sample dataset instead of the entire dataset.  In this tip we look at a couple of different ways to do data sampling transformations in SSIS.

Solution

There are two transformations in SSIS that will help us generate a smaller subset of data based on sampling. They are row sampling and percentage sampling transformations.

Usage of SSIS sampling transformation

  • Instead of using the full production dataset, a sample can be used to test the package multiple times for different conditions. This can be very helpful to replicate production bugs on a development server.
  • Based on the environment capacity (Dev, Test & UAT servers), the test data size can be adjusted easily.
  • Allows you to test data mining models with different test data.
  • Allows for data profiling activities to understand the quality of the source data.

SSIS Row Sampling Transformation

The SSIS Row Sampling transformation generates a sample dataset based on a random selection from the input dataset. The expected number of sample records must be entered in the transformation. By default the transformation generates a random number to select the input rows. So every time you run the transformation you may a get different set of records. This is very useful, as you may want to test the data flow for different sets of representative data.

However if you prefer to have the same set of data every time, then you can define a seed. This can be done by enabling the option "Use the following random seed".

This transformation generates two outputs. The "sampling selected output" will provide the sample dataset based on the number of rows keyed in and the rest of the records will be available thru the output "Sampling unselected output".

The sampling seed and the sampling value properties can also be set using the properties window.

In the below image, we are extracting data from Fact InternetSales and would like to sample 100 records. The row sampling transformation has been added as below.

SQL Server Integration Services Row Sampling Transformation

The sampling value (number of rows) has been set to 100. Also the sampling seed has been set to 1, so that we get the same records each time we this is run.

SQL Server Integration Services Row Sampling Transformation Editor

After successful execution, we can see the sampling records thru the "Sampling Selected Output".

SQL Server Integration Services Sampling Selected Output

The data viewer shows the sample output.

SQL Server Integration Services Data Viewer

SSIS Percentage Sampling

The SSIS Percentage Sampling transformation works the same way as the Row Sampling transformation except we provide the percentage of sample records instead of a specific record count.

The Percentage Sampling transformation uses an algorithm to determine whether a row should be included in the sample output. Because of the functionality of the algorithm, the number of output rows will not exactly match the specified percentage.  Let's assume we have a source dataset with 10,000 records and you want to inspect the data quality by sampling 10% of the dataset. Ideally you expect to see 1000 records, but the actual records sampled by SSIS may be greater or less than 1000 records.

In the below image, we are extracting data from Fact InternetSales and would like to sample 10% of the records. The percentage sampling transformation has been added as below.

SQL Server Integration Services Percentage Sampling

The sampling value (percentage of rows) has been set to 1.

SQL Server Integration Services Percentage Sampling Transformation Editor

After successful execution, we can see the sampling records thru the "Sampling Selected Output". 

SQL Server Integration Services Sampling Selected Output

Summary

The Sampling transformations has a wide variety of usage. In data mining analysis, both the selected and unselected outputs can be used to design and test the model. Both the row and percentage sampling transformations are easy to learn and configure.

Next Steps


Last Updated: 2016-12-20


get scripts

next tip button



About the author
MSSQLTips author Nat Sundar Nat Sundar is working as an independent SQL BI consultant in the UK with a Bachelors Degree in Engineering.

View all my tips




Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name
*Email
Email me updates

Signup for our newsletter

I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.





Thursday, December 22, 2016 - 8:03:03 AM - Nat Sundar Back To Top

 

 Thanks Javid for the nice comments


Wednesday, December 21, 2016 - 4:28:51 PM - javid khan Back To Top

 Thnaks Nat for small but very helpful artical.

 Keep good work of sharing.

 

 



download

























get free sql tips

I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.



Learn more about SQL Server tools