Learn more about SQL Server tools



solving sql server problems for millions of dbas and developers since 2006 join MSSQLTips for free SQL Server tips













































   Got a SQL tip?
            We want to know!

SQL Server Integration Services SSIS Best Practices

MSSQLTips author Arshad Ali By:   |   Read Comments (31)   |   Related Tips: 1 | 2 | 3 | 4 | More > Integration Services Best Practices
Problem
SQL Server Integration Services (SSIS) has grown a lot from its predecessor DTS (Data Transformation Services) to become an enterprise wide ETL (Extraction, Transformation and Loading) product in terms of its usability, performance, parallelism etc. Apart from being an ETL product, it also provides different built-in tasks to manage a SQL Server instance. Although the internal architecture of SSIS has been designed to provide a high degree of performance and parallelism there are still some best practices to further optimize performance. In this tip series, I will be talking about best practices to consider while working with SSIS which I have learned while working with SSIS for the past couple of years.

Solution
As mentioned above, SSIS is the successor of DTS (of SQL Server 7/2000). If you are coming from a DTS background, SSIS packages may look similar to DTS packages, but it's not the case in reality. What I mean is, SSIS is not an enhancement to DTS but rather a new product which has been written from scratch to provide high performance and parallelism and as a result of this it overcomes several limitations of DTS.

SSIS 2008 has further enhanced the internal dataflow pipeline engine to provide even better performance, you might have heard the news that SSIS 2008 has set an ETL World record of uploading 1TB of data in less than half an hour.

The best part of SSIS is that it is a component of SQL server.  It comes free with the SQL Server installation and you don't need a separate license for it. Because of this, along with hardcore BI developers, database developers and database administrators are also using  it to transfer and transform data.


Best Practice #1 - Pulling High Volumes of Data

Recently we had to pull data from a source table which had 300 millions records to a new target table. Initially when the SSIS package started, everything looked fine, data was being transferred as expected but gradually the performance degraded and the data transfer rate went down dramatically. During analysis we found that the target table had a primary clustered key and two non-clustered keys. Because of the high volume of data inserts into the target table these indexes got fragmented heavily up to 85%-90%. We used the online index rebuilding feature to rebuild/defrag the indexes, but again the fragmentation level was back to 90% after every 15-20 minutes during the load. This whole process of data transfer and parallel online index rebuilds took almost 12-13 hours which was much more than our expected time for data transfer.

Then we came with an approach to make the target table a heap by dropping all the indexes on the target table in the beginning, transfer the data to the heap and on data transfer completion, recreate indexes on the target table. With this approach, the whole process (by dropping indexes, transferring data and recreating indexes) took just 3-4 hours which was what we were expecting.

This whole process has been graphically shown in the below flow chart. So the recommendation is to consider dropping your target table indexes if possible before inserting data to it specially if the volume of inserts is very high.


Best Practice #2 - Avoid SELECT *

The Data Flow Task (DFT) of SSIS uses a buffer (a chunk of memory) oriented architecture for data transfer and transformation. When data travels from the source to the destination, the data first comes into the buffer, required transformations are done in the buffer itself and then written to the destination.

The size of the buffer is dependant on several factors, one of them is the estimated row size. The estimated row size is determined by summing the maximum size of all the columns in the row. So the more columns in a row means less number of rows in a buffer and with more buffer requirements the result is performance degradation. Hence it is recommended to select only those columns which are required at destination.

Even if you need all the columns from the source, you should use the column name specifically in the SELECT statement otherwise it takes another round for the source to gather meta-data about the columns when you are using SELECT *.

If you pull columns which are not required at destination (or for which no mapping exists) SSIS will emit warnings like this.

[SSIS.Pipeline] Warning: The output column "SalariedFlag" (64) on output "OLE DB Source Output" (11) and component "OLE DB Source" (1) is not subsequently used in the Data Flow task. Removing this unused output column can increase Data Flow task performance.
[SSIS.Pipeline] Warning: The output column "CurrentFlag" (73) on output "OLE DB Source Output" (11) and component "OLE DB Source" (1) is not subsequently used in the Data Flow task. Removing this unused output column can increase Data Flow task performance.

Beware when you are using "Table or view" or "Table name or view name from variable" data access mode in OLEDB source. It behaves like SELECT * and pulls all the columns, use this access mode only if you need all the columns of the table or view from the source to the destination.

Tip : Try to fit as many rows into the buffer which will eventually reduce the number of buffers passing through the dataflow pipeline engine and improve performance.


Best Practice #3 - Effect of OLEDB Destination Settings

There are couple of settings with OLEDB destination which can impact the performance of data transfer as listed below.

Data Access Mode - This setting provides the 'fast load' option which internally uses a BULK INSERT statement for uploading data into the destination table instead of a simple INSERT statement (for each single row) as in the case for other options. So unless you have a reason for changing it, don't change this default value of fast load. If you select the 'fast load' option, there are also a couple of other settings which you can use as discussed below.

Keep Identity - By default this setting is unchecked which means the destination table (if it has an identity column) will create identity values on its own. If you check this setting, the dataflow engine will ensure that the source identity values are preserved and same value is inserted into the destination table.

Keep Nulls - Again by default this setting is unchecked which means default value will be inserted (if the default constraint is defined on the target column) during insert into the destination table if NULL value is coming from the source for that particular column. If you check this option then default constraint on the destination table's column will be ignored and preserved NULL of the source column will be inserted into the destination.

Table Lock - By default this setting is checked and the recommendation is to let it be checked unless the same table is being used by some other process at same time. It specifies a table lock will be acquired on the destination table instead of acquiring multiple row level locks, which could turn into lock escalation problems.

Check Constraints - Again by default this setting is checked and recommendation is to un-check it if you are sure that the incoming data is not going to violate constraints of the destination table. This setting specifies that the dataflow pipeline engine will validate the incoming data against the constraints of target table. If you un-check this option it will improve the performance of the data load.


Best Practice #4 - Effect of Rows Per Batch and Maximum Insert Commit Size Settings

Rows per batch - The default value for this setting is -1 which specifies all incoming rows will be treated as a single batch. You can change this default behavior and break all incoming rows into multiple batches. The allowed value is only positive integer which specifies the maximum number of rows in a batch.

Maximum insert commit size - The default value for this setting is '2147483647' (largest value for 4 byte integer type) which specifies all incoming rows will be committed once on successful completion. You can specify a positive value for this setting to indicate that commit will be done for those number of records. You might be wondering, changing the default value for this setting will put overhead on the dataflow engine to commit several times. Yes that is true, but at the same time it will release the pressure on the transaction log and tempdb to grow tremendously specifically during high volume data transfers.

The above two settings are very important to understand to improve the performance of tempdb and the transaction log. For example if you leave 'Max insert commit size' to its default, the transaction log and tempdb will keep on growing during the extraction process and if you are transferring a high volume of data the tempdb will soon run out of memory as a result of this your extraction will fail. So it is recommended to set these values to an optimum value based on your environment.

Note: The above recommendations have been done on the basis of experience gained working with DTS and SSIS for the last couple of years. But as noted before there are other factors which impact the performance, one of the them is infrastructure and network. So you should do thorough testing before putting these changes into your production environment.

 

Next Steps



Last Update: 9/18/2009


About the author
MSSQLTips author Arshad Ali
Arshad Ali is a SQL and BI Developer focusing on Data Warehousing projects for Microsoft.

View all my tips


print tip Print  
Become a paid author





join MSSQLTips for free SQL Server tips     



Learn more about SQL Server tools
Comments and Feedback:
Friday, September 18, 2009 - 7:44:36 AM - meet2zeni Read The Tip

Hi Arshad,

Thanks for such a detailing on the topic. I have read all your articles on MSSQLTIPS. I am a great fan of your writing and understanding on the subject, As you describe such a complex topic with such a simplicity. I am always looking forward to read more and I recommend your articles to my friends and collegues.

Thanks and Regards,
Zainab Dhuliawala


Friday, September 18, 2009 - 8:12:58 AM - DavidB Read The Tip

Thanks for the useful list of recommendations that everyone should keep in mind while building a SSIS package.

 I do not think SSIS is free. It is a component of SQL Server, which fails uses the same license purchased for the DBMS and Reporting services. Therefore, you are correct that a separate license is not needed for SSIS. However, if you were to setup a server with only the SSIS component installed, you would be require to have a SQL Server license for that server. I could be wrong, so that is why whenever I get confused about SQL Server licensing, I contact our Microsoft representative or Microsoft directly. 

 


Sunday, September 27, 2009 - 6:24:02 AM - arshad0384 Read The Tip

Hi Zainab,

Thanks a lot for your encouraging words and appreciation. 

I am really feeling happy, you find my articles so fruitfull. Nothing could be more satisfying for an author than his article being helpful for the audience/readers.

Please feel free to write me if you want to provide any feedback or want an article on any particular technologies.

Thanks a lot again for your kind words. :)

 


Sunday, September 27, 2009 - 6:48:24 AM - arshad0384 Read The Tip

Hi David, 

Yes you are right, along with SSRS and SSAS, SSIS is also a component of SQL Server. So as long as you have SQL Server Licence for a box, you can use any of these components on that particular box without needing another licence for these components. But if you want to use it on any other box, than you have license for, then in that case you will be required to have license for that new box as well.

For detail and latest information about Licensing I would encourage readers to visit Microsoft site or call Microsoft representative.

http://www.microsoft.com/sqlserver/2008/en/us/licensing.aspx

http://download.microsoft.com/download/1/e/6/1e68f92c-f334-4517-b610-e4dee946ef91/2008%20SQL%20Licensing%20Overview%20final.docx

http://www.microsoft.com/sqlserver/2008/en/us/licensing-faq.aspx#licensing

Hope it helps.


Friday, October 09, 2009 - 2:54:22 PM - Jim Read The Tip

Hello,

In your article, there is the statement:

Beware when you are using "Table or view" or "Table name or view name from variable" data access mode in OLEDB source. It behaves like SELECT * and pulls all the columns, use this access mode only if you need all the columns of the table or view from the source to the destination.

If I understand this correctly, it is saying that even if I 'uncheck' several of the 'Available External Columns' in The OLE - SRC, all of the columns will be selected when using 'Table or View' - even if they are unchecked.

 Am I understanding this corrrectly? If so, why would one be allowed to 'check' or 'uncheck' any of the 'Available External Columns'. This doesn't make sense to me.

Thanks for you time,

Jim 

  

 


Saturday, October 10, 2009 - 3:15:57 AM - arshad0384 Read The Tip

Hi Jim, 

Its interesting, let me clarify this.

When you use, "Table or view" or "SELECT *" mode SSIS pulls all the columns data from the source to its buffer irrespective of how many columns you have checked or unchecked. So what is the benefit of unchecking columns of 'Available External Columns' in The OLE - SRC?

Well with this you instruct SSIS to flow down all selected columns down the execution pipeline. See there are two things pulling data from source to the buffer, then passing it to the destination. What you have to pass down to transformation/destination can be controlled with this checking/unchecking. But what you pull from the source, is dependent on what statement you write. Thats why its recommended to use SELECT statement with the only columns required instead of using "Table or view" or "SELECT *" mode.

This is what I have observed, you too can do onething, use SQL Server profiler to see what statements are fired at source in different cases. Though I will try to find some more information on this and share with you.

Hope it clarifies your doubts. 


Thursday, December 09, 2010 - 5:06:47 PM - Matt Read The Tip

I'm not sure if anyone will see this or not since the article is from a while back, but I was wondering what exactly the difference is between the batch size and the maximum insert commit size. If I have 5,000 records in a batch for a 1,000,000 record transfer will it commit after each batch? If it doesn't, then why specify a batch size?

Thanks. Great article BTW.


Friday, December 10, 2010 - 6:27:59 AM - Arshad Read The Tip

Hope these links might be helpful for you:

http://msdn.microsoft.com/en-us/library/ms188439.aspx

  • Rows per batch – blank text box indicates its default value -1. If so all incoming rows will be considered as one batch. A specified nonzero, positive integer will direct the pipeline engine to break the incoming rows in multiple chunks of N (what you specify) rows. This enables the number of rows in a batch to be specifically defined.
  • Maximum insert commit size – the specified batch size that the OLE DB destination tries to commit during fast load operations; it operates on chunks of data as they are inserted into the destination. If a value is provided for this property, the destination commits rows in batches that are the smaller than the Maximum insert commit size or the remaining rows in the buffer that is currently being processed.
  • More details you can find here : http://www.sql-server-performance.com/articles/biz/SSIS_Introduction_Part2_p1.aspx

    http://www.sql-server-performance.com/articles/biz/SSIS_An_Inside_View_Part_2_p1.aspx

    Hope it clears your doubts! :)


    Wednesday, January 18, 2012 - 10:45:00 AM - Jason Yousef Read The Tip

    Great Article, Thanks...


    Friday, February 17, 2012 - 1:36:34 PM - laxman Read The Tip
    HI, in my source data in .txt formate Data having (ename,salary,depot), i want to data in destination .csv formate and added new column (eid) with incremental order.

    Sunday, February 19, 2012 - 12:58:26 AM - Arshad Read The Tip
    Hi Laxman, SSIS is very much capable of doing this kind of data movement. I am not sure if you are facing any issue. You can refer SQL Server Integration Services (SSIS) tutorial if you are new to it. http://www.mssqltips.com/sqlservertutorial/200/sql-server-integration-services-ssis/

    Monday, April 16, 2012 - 11:46:01 PM - Dhananjay Read The Tip

    Absoluty fantastic artical which will definatly help to upbring the SSIS performance. Thanks allot. Dhananjay


    Thursday, May 31, 2012 - 1:09:39 PM - Elena Read The Tip

    Hi Arshad,

    Thank you for your article. I have a question.

    Recently I tested SSIS package that loaded data from sas environment into SQL table. The table has 52000000 rows. I implemented the same logic, such as dropped indexes (clustered, nonclustered) and recreated them after data was loaded into the table. But I get a note from DBA that creation of the indexes blocked somebody’s process in the server. They recommend me to disable them instead. What would be your suggestion in this situation?

    Thanks,

    Elena

     


    Monday, July 30, 2012 - 8:31:32 AM - Mushir Read The Tip

    Gr8 article....


    Monday, July 30, 2012 - 8:32:28 AM - Mushir Read The Tip

    Elena, when do you run this load? The method suggested by Arshad shall be used in case the target table can exclusiely be used by the load process. If there are other people using the system concurrently, they certainly will be affeted if you drop the indexes. In such cases, you have to go for some other way to optimise your package. If, however, there are times when the system can be exclusivley used by your package(midnights/weekends), you can use this method and schedule your package during such time.


    Monday, August 13, 2012 - 2:43:22 AM - Arshad Read The Tip

    Thanks Mushir, you are absoulutely right in saying that.

    Elena, you can disable and rebuild only non-clustered indexes only, disabling cluster index will make the table unavailable.

    As suggested by Mushir, either you should consider scheduling your package at midnight or weekend when no else is using the table or consider disabling and rebuilding non cluster indexes along with also rebuilding cluster index (may be online however it has its own considerations to take, refer link below). Rebuilding indexes is required to ensure fragmentation level is under control. Try out these different options and see which one appropriately suits your particular scenario.

    http://www.mssqltips.com/sqlservertip/2361/rebuilding-sql-server-indexes-using-the-online-option/


    Monday, January 07, 2013 - 4:48:12 AM - Martin Read The Tip

    Great article Arshad, thank you!

    During a transfer of >200M rows I had problems with the transaction log growing huge and causing out of disk space, ultimately failing the task.

    Setting the "Maximum commit size" on the OLE DB destination to 10 000 000 (~10MB) seems to have done the trick! Almost 10M rows transferred when I write this and the size of the transaction log remains small.

    Regarding the "Rows per batch" setting, I read on another forum that it should be set to the "estimated number of source rows" and it is only used as a "hint to the query optimizer". What is your view on this? 

    Martin

     


    Tuesday, January 29, 2013 - 2:32:49 PM - Shelley Read The Tip

    Hello.  I'm using BIDS to create SSIS packages on the following version: MS Visual Studio 2008 Version 9.0.30729.4462 QFE and MS .NET Framework Version 3.5 SP1.  It does NOT sem to be obeying the rules I would expect for the "Keep Nulls" option when UNCHECKED.  Some of the value for one of my incoming columns are NULL.  There is a NOT NULL and a default constraint defined on the target table.  But my package is blowing up and reporting that it is trying to insert a NULL value into a Non-nullable column. I have "Keep Nulls" UNchecked, but it is still tryinig to insert a NULL into this Non-Nullable column in my target table.  Any ideas? Is there a known bug?

     


    Friday, February 08, 2013 - 2:03:03 PM - Kevin Murphy Read The Tip

    Shelley,

    I would use a Derived Column task to assign a default value.

    The keep nulls checked will only work on inserting a null.

    In my opinion. it is better to have a default value than allow a null.

    The possibility that a null (an unknown value), could match a known value is rare but it can happen.

     


    Monday, March 25, 2013 - 4:40:30 PM - Pratap Chavda Read The Tip

    Arshad,

    Thank you very much for the best practices articles.

    I have got a question. I am new to SQL Server. I have been able to design some packages. The question is how to deploy them to the same server in different environments and to a different server. I have read those articles too but fail to understand them. Is there any simple way that you can explain me to adopt?

    Thanks.

    Pratap.


    Monday, March 25, 2013 - 4:46:08 PM - Pratap Chavda Read The Tip

    Arshad,

    Thanks you for your articles on Best Practices. I am new to SQL Server. I have developed some packages using BIDS. But fail to understand how to deploy to the different of the same or different server. I have read those articles too. It is difficult for me to understand them. Is it possible for you to explain them in a simple way that I could understand?

    Thanks.

    Pratap.


    Thursday, April 04, 2013 - 3:05:20 AM - Ismail Mohammed Read The Tip

    Hi Arshad

    Thank you for your post on SSIS best practices.

    Ismail Mohammed


    Tuesday, August 06, 2013 - 5:14:10 AM - SQLDBA Read The Tip

    hi,

     

    Error Come on copy database to CVv file.

    [1b) Dump data into csv file [19]] Error: Data conversion failed. The data conversion for column "Title" returned status value 4 and status text "Text was truncated or one or more characters had no match in the target code page.".

    Please let me know solution


    Wednesday, August 07, 2013 - 11:19:45 AM - Arshad Ali Read The Tip

    It happens when source data cannot be accomodated in target column becuase of the target column being smaller in size than source column.

    Please increase the target column data size to make it align with source column size.

    SSIS designer detects automatically the changes when you open data flow task in designer and let you know you to update the component.

    If you want to do it manullay, you can change the properties of the data flow task to increase the size in the package or easiest way is to delete the existing source and destination, drag new ones and do the mappings as fresh. 


    Wednesday, August 07, 2013 - 1:27:45 PM - sqldba Read The Tip

    Package is running .configuration file set in sql job it show package run successfully.but step show failure like not access on variables.. ??????????????????


    Wednesday, September 04, 2013 - 5:17:08 AM - M.Seenuvasan Read The Tip

    Nice to see such a usfull information..


    Friday, December 27, 2013 - 8:31:39 AM - ericv Read The Tip
    Arshad, I think your article has been plagiarised here : http://www.codetails.com/bbc172038/increasing-the-performance-of-ssis-package-best-practices/20121107

    Friday, January 17, 2014 - 5:05:44 AM - Jack Russell Read The Tip

    I created an SSIS package using the SQL server import and export wizard and clicked the box Delete rows in destination table. I created a new Integration Services package in BIDS and imported my package from the file system.

    I am now editing this package in BIDS to add more tables to it, but there appears to be no facility to include the command to delete rows in the destination table.

    Does the Table or View - Fast load action do this as a matter of course?

     

     


    Wednesday, March 26, 2014 - 5:52:48 PM - Nick Ryan Read The Tip

    Jack, how about you add an Execute SQL Task and use that to TRUNCATE your table?


    Wednesday, March 26, 2014 - 5:56:34 PM - Nick Ryan Read The Tip

    After applying a patch to our SQL Servers (2008 R2), the way the Bulk Upload table lock is applied was changed. This resulted in a number of our packages ending up in a kind of deadlock situation.

    This appeared to be where inserting into a table with a clustered index and attempting to do multiple batches.

    We found we had to remove the table lock in these cases.


    Friday, July 25, 2014 - 2:32:02 PM - Sunny Read The Tip

    Great article and a very simple explanation. The comments also helped clarify some of the doubts. Thanks!



    Post a Comment or Question

    Keep it clean and stay on the subject or we may delete your comment.
    Your email address is not published. Required fields are marked with an asterisk (*)

    *Name   *Email Notify for updates



    Comments
    Get free SQL tips:

    *Enter Code refresh code


     
    Sponsor Information







    Copyright (c) 2006-2014 Edgewood Solutions, LLC All rights reserved
    privacy | disclaimer | copyright | advertise | about
    authors | contribute | feedback | giveaways | free t-shirt | user groups | community | events | first timer?
    Some names and products listed are the registered trademarks of their respective owners.