![]() |
|
|
By: Arshad Ali | Read Comments (12) | Print Arshad is a SQL and BI Developer focusing on Data Warehousing projects for Microsoft. Related Tips: 1 | 2 | 3 | 4 | More |
|
Solution
As mentioned above, SSIS is the successor of DTS (of SQL Server 7/2000). If you are coming from a DTS background, SSIS packages may look similar to DTS packages, but it's not the case in reality. What I mean is, SSIS is not an enhancement to DTS but rather a new product which has been written from scratch to provide high performance and parallelism and as a result of this it overcomes several limitations of DTS.
SSIS 2008 has further enhanced the internal dataflow pipeline engine to provide even better performance, you might have heard the news that SSIS 2008 has set an ETL World record of uploading 1TB of data in less than half an hour.
The best part of SSIS is that it is a component of SQL server. It comes free with the SQL Server installation and you don't need a separate license for it. Because of this, along with hardcore BI developers, database developers and database administrators are also using it to transfer and transform data.
Best Practice #1 - Pulling High Volumes of Data
Recently we had to pull data from a source table which had 300 millions records to a new target table. Initially when the SSIS package started, everything looked fine, data was being transferred as expected but gradually the performance degraded and the data transfer rate went down dramatically. During analysis we found that the target table had a primary clustered key and two non-clustered keys. Because of the high volume of data inserts into the target table these indexes got fragmented heavily up to 85%-90%. We used the online index rebuilding feature to rebuild/defrag the indexes, but again the fragmentation level was back to 90% after every 15-20 minutes during the load. This whole process of data transfer and parallel online index rebuilds took almost 12-13 hours which was much more than our expected time for data transfer.
Then we came with an approach to make the target table a heap by dropping all the indexes on the target table in the beginning, transfer the data to the heap and on data transfer completion, recreate indexes on the target table. With this approach, the whole process (by dropping indexes, transferring data and recreating indexes) took just 3-4 hours which was what we were expecting.
This whole process has been graphically shown in the below flow chart. So the recommendation is to consider dropping your target table indexes if possible before inserting data to it specially if the volume of inserts is very high.
%20-%20Best%20Practices%20-%20Part%201%20-%20img1.jpg)
Best Practice #2 - Avoid SELECT *
The Data Flow Task (DFT) of SSIS uses a buffer (a chunk of memory) oriented architecture for data transfer and transformation. When data travels from the source to the destination, the data first comes into the buffer, required transformations are done in the buffer itself and then written to the destination.
The size of the buffer is dependant on several factors, one of them is the estimated row size. The estimated row size is determined by summing the maximum size of all the columns in the row. So the more columns in a row means less number of rows in a buffer and with more buffer requirements the result is performance degradation. Hence it is recommended to select only those columns which are required at destination.
Even if you need all the columns from the source, you should use the column name specifically in the SELECT statement otherwise it takes another round for the source to gather meta-data about the columns when you are using SELECT *.
If you pull columns which are not required at destination (or for which no mapping exists) SSIS will emit warnings like this.
| [SSIS.Pipeline] Warning: The output column "SalariedFlag" (64) on output "OLE DB Source Output" (11) and component "OLE DB Source" (1) is not subsequently used in the Data Flow task. Removing this unused output column can increase Data Flow task performance. [SSIS.Pipeline] Warning: The output column "CurrentFlag" (73) on output "OLE DB Source Output" (11) and component "OLE DB Source" (1) is not subsequently used in the Data Flow task. Removing this unused output column can increase Data Flow task performance. |
Beware when you are using "Table or view" or "Table name or view name from variable" data access mode in OLEDB source. It behaves like SELECT * and pulls all the columns, use this access mode only if you need all the columns of the table or view from the source to the destination.
Tip : Try to fit as many rows into the buffer which will eventually reduce the number of buffers passing through the dataflow pipeline engine and improve performance.
Best Practice #3 - Effect of OLEDB Destination Settings
There are couple of settings with OLEDB destination which can impact the performance of data transfer as listed below.
Data Access Mode – This setting provides the 'fast load' option which internally uses a BULK INSERT statement for uploading data into the destination table instead of a simple INSERT statement (for each single row) as in the case for other options. So unless you have a reason for changing it, don't change this default value of fast load. If you select the 'fast load' option, there are also a couple of other settings which you can use as discussed below.
Keep Identity – By default this setting is unchecked which means the destination table (if it has an identity column) will create identity values on its own. If you check this setting, the dataflow engine will ensure that the source identity values are preserved and same value is inserted into the destination table.
Keep Nulls – Again by default this setting is unchecked which means default value will be inserted (if the default constraint is defined on the target column) during insert into the destination table if NULL value is coming from the source for that particular column. If you check this option then default constraint on the destination table's column will be ignored and preserved NULL of the source column will be inserted into the destination.
Table Lock – By default this setting is checked and the recommendation is to let it be checked unless the same table is being used by some other process at same time. It specifies a table lock will be acquired on the destination table instead of acquiring multiple row level locks, which could turn into lock escalation problems.
Check Constraints – Again by default this setting is checked and recommendation is to un-check it if you are sure that the incoming data is not going to violate constraints of the destination table. This setting specifies that the dataflow pipeline engine will validate the incoming data against the constraints of target table. If you un-check this option it will improve the performance of the data load.
Best Practice #4 - Effect of Rows Per Batch and Maximum Insert Commit Size Settings
Rows per batch – The default value for this setting is -1 which specifies all incoming rows will be treated as a single batch. You can change this default behavior and break all incoming rows into multiple batches. The allowed value is only positive integer which specifies the maximum number of rows in a batch.
Maximum insert commit size – The default value for this setting is '2147483647' (largest value for 4 byte integer type) which specifies all incoming rows will be committed once on successful completion. You can specify a positive value for this setting to indicate that commit will be done for those number of records. You might be wondering, changing the default value for this setting will put overhead on the dataflow engine to commit several times. Yes that is true, but at the same time it will release the pressure on the transaction log and tempdb to grow tremendously specifically during high volume data transfers.
The above two settings are very important to understand to improve the performance of tempdb and the transaction log. For example if you leave 'Max insert commit size' to its default, the transaction log and tempdb will keep on growing during the extraction process and if you are transferring a high volume of data the tempdb will soon run out of memory as a result of this your extraction will fail. So it is recommended to set these values to an optimum value based on your environment.
Note: The above recommendations have been done on the basis of experience gained working with DTS and SSIS for the last couple of years. But as noted before there are other factors which impact the performance, one of the them is infrastructure and network. So you should do thorough testing before putting these changes into your production environment.
Next Steps
| Share: | Share | Tweet |
|
![]() |
![]() |
Connect with MSSQLTips.com |
| Friday, September 18, 2009 - 7:44:36 AM - meet2zeni | Read The Tip |
|
Hi Arshad, Thanks for such a detailing on the topic. I have read all your articles on MSSQLTIPS. I am a great fan of your writing and understanding on the subject, As you describe such a complex topic with such a simplicity. I am always looking forward to read more and I recommend your articles to my friends and collegues. Thanks and Regards, |
|
| Friday, September 18, 2009 - 8:12:58 AM - DavidB | Read The Tip |
|
Thanks for the useful list of recommendations that everyone should keep in mind while building a SSIS package. I do not think SSIS is free. It is a component of SQL Server, which fails uses the same license purchased for the DBMS and Reporting services. Therefore, you are correct that a separate license is not needed for SSIS. However, if you were to setup a server with only the SSIS component installed, you would be require to have a SQL Server license for that server. I could be wrong, so that is why whenever I get confused about SQL Server licensing, I contact our Microsoft representative or Microsoft directly.
|
|
| Sunday, September 27, 2009 - 6:24:02 AM - arshad0384 | Read The Tip |
|
Hi Zainab, Thanks a lot for your encouraging words and appreciation. I am really feeling happy, you find my articles so fruitfull. Nothing could be more satisfying for an author than his article being helpful for the audience/readers. Please feel free to write me if you want to provide any feedback or want an article on any particular technologies. Thanks a lot again for your kind words. :)
|
|
| Sunday, September 27, 2009 - 6:48:24 AM - arshad0384 | Read The Tip |
|
Hi David, Yes you are right, along with SSRS and SSAS, SSIS is also a component of SQL Server. So as long as you have SQL Server Licence for a box, you can use any of these components on that particular box without needing another licence for these components. But if you want to use it on any other box, than you have license for, then in that case you will be required to have license for that new box as well. For detail and latest information about Licensing I would encourage readers to visit Microsoft site or call Microsoft representative. http://www.microsoft.com/sqlserver/2008/en/us/licensing.aspx http://www.microsoft.com/sqlserver/2008/en/us/licensing-faq.aspx#licensing Hope it helps. |
|
| Friday, October 09, 2009 - 2:54:22 PM - Jim | Read The Tip |
|
Hello, In your article, there is the statement: Beware when you are using "Table or view" or "Table name or view name from variable" data access mode in OLEDB source. It behaves like SELECT * and pulls all the columns, use this access mode only if you need all the columns of the table or view from the source to the destination. If I understand this correctly, it is saying that even if I 'uncheck' several of the 'Available External Columns' in The OLE - SRC, all of the columns will be selected when using 'Table or View' - even if they are unchecked. Am I understanding this corrrectly? If so, why would one be allowed to 'check' or 'uncheck' any of the 'Available External Columns'. This doesn't make sense to me. Thanks for you time, Jim
|
|
| Saturday, October 10, 2009 - 3:15:57 AM - arshad0384 | Read The Tip |
|
Hi Jim, Its interesting, let me clarify this. When you use, "Table or view" or "SELECT *" mode SSIS pulls all the columns data from the source to its buffer irrespective of how many columns you have checked or unchecked. So what is the benefit of unchecking columns of 'Available External Columns' in The OLE - SRC? Well with this you instruct SSIS to flow down all selected columns down the execution pipeline. See there are two things pulling data from source to the buffer, then passing it to the destination. What you have to pass down to transformation/destination can be controlled with this checking/unchecking. But what you pull from the source, is dependent on what statement you write. Thats why its recommended to use SELECT statement with the only columns required instead of using "Table or view" or "SELECT *" mode. This is what I have observed, you too can do onething, use SQL Server profiler to see what statements are fired at source in different cases. Though I will try to find some more information on this and share with you. Hope it clarifies your doubts. |
|
| Thursday, December 09, 2010 - 5:06:47 PM - Matt | Read The Tip |
|
I'm not sure if anyone will see this or not since the article is from a while back, but I was wondering what exactly the difference is between the batch size and the maximum insert commit size. If I have 5,000 records in a batch for a 1,000,000 record transfer will it commit after each batch? If it doesn't, then why specify a batch size? Thanks. Great article BTW. |
|
| Friday, December 10, 2010 - 6:27:59 AM - Arshad | Read The Tip |
|
Hope these links might be helpful for you: http://msdn.microsoft.com/en-us/library/ms188439.aspx More details you can find here : http://www.sql-server-performance.com/articles/biz/SSIS_Introduction_Part2_p1.aspx http://www.sql-server-performance.com/articles/biz/SSIS_An_Inside_View_Part_2_p1.aspx Hope it clears your doubts! :) |
|
| Wednesday, January 18, 2012 - 10:45:00 AM - Jason Yousef | Read The Tip |
|
Great Article, Thanks... |
|
| Friday, February 17, 2012 - 1:36:34 PM - laxman | Read The Tip |
| HI, in my source data in .txt formate Data having (ename,salary,depot), i want to data in destination .csv formate and added new column (eid) with incremental order. | |
| Sunday, February 19, 2012 - 12:58:26 AM - Arshad | Read The Tip |
| Hi Laxman, SSIS is very much capable of doing this kind of data movement. I am not sure if you are facing any issue. You can refer SQL Server Integration Services (SSIS) tutorial if you are new to it. http://www.mssqltips.com/sqlservertutorial/200/sql-server-integration-services-ssis/ | |
| Monday, April 16, 2012 - 11:46:01 PM - Dhananjay | Read The Tip |
|
Absoluty fantastic artical which will definatly help to upbring the SSIS performance. Thanks allot. Dhananjay |
|
|
privacy | disclaimer | copyright | advertise | about authors | contribute | feedback | giveaways | user groups Some names and products listed are the registered trademarks of their respective owners. Edgewood Solutions LLC | MSSharePointTips.com | MSSQLTips.com |