Problem In the previous tips (SQL Server Integration Services (SSIS) - Best Practices - Part 1, Part 2 and Part 3) of this series I briefly talked about SSIS and few of the best practices to consider while designing SSIS packages. Continuing on this path I am going to discuss some more best practices of SSIS package design, how you can use lookup transformations and what considerations you need to take, the impact of implicit type cast in SSIS, changes in SSIS 2008 internal system tables and stored procedures and finally some general guidelines.
Solution In this tip my recommendations are around
lookup transformation and different considerations which you need to take while using it,
changes in SSIS 2008 system tables and stored procedures,
impact of implicit typecast and
finally some general guidelines at the end.
As you can see this tip starts at best practice #15. See these other tips (Part 1, Part 2 and Part3) for best practices 1-14.
Best Practice #15 - Did you know, SSIS is case-sensitive?
In one of my projects, once we added one new column in a source table and wanted it to be transferred to a destination table as well. We did the required changes in our SSIS package to pull the data for this additional column. But when we started testing we noticed our SSIS package was failing with the following error.
[OLE DB Destination ] Warning: The external columns for component "OLE DB Destination" (16) are out of synchronization with the data source columns. The column "EmployeeId" needs to be added to the external columns. The external column "EmployeeID" (82) needs to be removed from the external columns. [SSIS.Pipeline] Error: "component "OLE DB Destination" (16)" failed validation and returned validation status "VS_NEEDSNEWMETADATA".
After spending several frustrating hours investigating the problem, we noticed that even though our SQL Server is case insensitive, the SSIS package is case sensitive. The reason for the above failure was that we altered the table for one column from EmployeeId to EmployeeID and since the SSIS package has stored source and destination column mappings with the old definition, it started failing because of this column name case change. So whenever you get this kind of error, match your source/destination columns case with the mapping stored in the SSIS package by going to the mapping page of OLEDB destination adaptor of the Data Flow Task.
Best Practice #16 - Lookup transformation consideration
In the data warehousing world, it's a frequent requirement to have records from a source by matching them with a lookup table. To perform this kind of transformation, SSIS has provides a built-in Lookup transformation.
Lookup transformation has been designed to perform optimally; for example by default it uses Full Caching mode, in which all reference dataset records are brought into memory in the beginning (pre-execute phase of the package) and kept for reference. This way it ensures the lookup operation performs faster and at the same time it reduces the load on the reference data table as it does not have to fetch each individual record one by one when required.
Though it sounds great there are some gotchas. First you need to have enough physical memory for storage of the complete reference dataset, if it runs out of memory it does not swap the data to the file system and therefore it fails the data flow task. This mode is recommended if you have enough memory to hold reference dataset and. your referenced data does not change frequently, in other words, changes at reference table will not be reflected once data is fetched into memory.
If you do not have enough memory or the data does change frequently you can either use Partial caching mode or No Caching mode.
In Partial Caching mode, whenever a record is required it is pulled from the reference table and kept in memory, with it you can also specify the maximum amount of memory to be used for caching and if it crosses that limit it removes the least used records from memory to make room for new records. This mode is recommended when you have memory constraints and your reference data does not change frequently.
No Caching mode performs slower as every time it needs a record it pulls from the reference table and no caching is done except the last row. It is recommended if you have a large reference dataset and you don't have enough memory to hold it and also if your reference data is changing frequently and you want the latest data.
More details about how the Lookup transformation works can be found here whereas lookup enhancements in SSIS 2008 can be found here.
To summarize the recommendations for lookup transformation:
Choose the caching mode wisely after analyzing your environment and after doing thorough testing
If you are using Partial Caching or No Caching mode, ensure you have an index on the reference table for better performance.
Instead of directly specifying a reference table in he lookup configuration, you should use a SELECT statement with only the required columns.
You should use a WHERE clause to filter out all the rows which are not required for the lookup.
In SSIS 2008, you can save your cache to be shared by different lookup transformations, data flow tasks and packages, utilize this feature wherever applicable.
Best Practice #17 - Names of system tables and procedures have changed between SSIS 2005 and SSIS 2008
SSIS gives you different location choices for storing your SSIS packages, for example you can store at file system, SQL server etc. When you store a package on SQL Server, it is stored in the system tables in msdb database. You can write your own code to upload/download packages from these system tables or use un-document system stored procedures for these tasks.
Now the twist in the story is, since SSIS 2005 has grown up from DTS, the system tables and system stored procedures use a naming convention like "dts" in its name as you can see in first column of table below. With SSIS 2008, the SSIS team has standardize the naming convention and uses "ssis" in its name as you can see in the second column of the table below. So if you are using these system tables or system stored procedure in your code and upgrading to SSIS 2008, your code will break unless you change your code to accommodate this new naming convention.
System Tables: sysdtspackagefolders90 sysdtspackages90
When you use Flat File Connection Manager, it treats all the columns as string [DT_STR] data type. You should convert all the numeric data to appropriate data type or else it will slow down the performance. You are wondering how? Actually SSIS uses buffer oriented architecture (refer Best Practice #6 and #7 for more details on this), it means it pulls the data from the source into the buffers, does the transformations in the buffers and passes it to the destinations. So as many rows as SSIS can accommodate in a single buffer, performance will be better. By having all the columns as string data type you are forcing SSIS to acquire more space in the buffer for numeric data types also (by treating them as string) and hence performance degradation.
Tip : Try to fit as many rows as you can into the buffer which will eventually reduce the number of buffers passing through the SSIS dataflow pipeline engine and improve overall performance.
Best Practice #19 - Finally some more general SSIS tips
Merge or Merge Join component requires incoming data to be sorted. If possible pull a sorted result-set by using ORDER BY clause at the source instead of using the Sort Transformation. Though there are times, you will be required to use Sort transformation for example pulling unsorted data from flat files.
As I said above there are few components which require data to be sorted as input to them. If your incoming data is already sorted then you can use the IsSorted property of output of the source adapter and specify the sort key columns on which the data is sorted as a hint to these components.
Try to maintain a small number of larger buffers and try to get as many row as you can into a buffer by removing unnecessary columns (discussed in Best Practice #2) or by tuning DefaultBufferMaxSize and DefaultBufferMaxRows properties of data flow task (discussed in Best Practice #7) or by using the appropriate data type of the column (discussed in Best Practice #18).
If you are on SQL server 2008, you can utilize some of its features for better performance. For example you can use the MERGE statement for joining INSERT and UPDATE data in a single statement while incrementally uploading data (no need for lookup transformation) and Change Data Capture for incremental data pulls.
RunInOptimizedMode (default FALSE) property of data flow task can be set to TRUE to disable columns for letting them flow down the line if they are not being used by downstream components of the data flow task. Hence it improves the performance of the data flow task. The SSIS project also has the RunInOptimizedMode property, which is applicable at design time only, which if you set to TRUE ensures all the data flow tasks are run in optimized mode irrespective of individual settings at the data flow task level.
Make use of sequence containers to group logical related tasks into a single group for better visibility and understanding.
By default a task, like Execute SQL task or Data Flow task, opens a connection when starting and closes it once its execution completes. If you want to reuse the same connection in multiple tasks, you can set RetainSameConnection property of connection manager to TRUE, in that case once the connection is opened it will stay open so that other tasks can reuse and also in that single connection you can use transactions spanning multiple tasks even without requiring the Distributed Transaction Coordinator windows service. Though you can reuse one connection with different tasks but you should also ensure you are not keeping your connection/transaction open for longer.
You should understand how protection level setting works for a package, how it saves data (in encrypted form by using User key or password) or it does not save data at all and what impact it has if you move your package from one system to another, refer here for more details on this.
The above recommendations have been done on the basis of experience gained working with DTS and SSIS for the last couple of years. But as noted before there are other factors which impact the performance, one of the them is infrastructure and network. So you must do thorough testing before putting these changes into your production environment.
I am closing this series on SQL Server Integration Services (SSIS) - Best Practices with this Part 4, if users find any other best practices (I am sure there might be several others) which I missed here, I request you to kindly provide your comments on that so that other can get benefited with our experiences.
If the data coming from the source is not string data, then SSIS will essentially add columns for string versions of the numeric data. Since the buffer has two versions of each column thus affected, each row takes up more space than it could.
While the string versions of the numeric columns may take more space that the number versions, they'll take less space than the string version *plus* the number version.
Saturday, December 05, 2009 - 10:34:30 AM - arshad0384
First of all, thanks a lot Flin for you encouragement and I am glad you liked the series.
Second, combining all these tips in one single article will make it significantly large and will make it tedious for reading. But of your reference, I am providing below a brief of all these tips along with its links which you can refer whenever required.
SQL Server Integration Services ( SSIS ) - Best Practices Part 1 briefly talks about SSIS and its capability in terms of enterprise ETL. Then it gives you an idea about what consideration you need to take while transferring high volume of data. Effects of different OLEDB Destination Settings, Rows Per Batch and Maximum Insert Commit Size Settings etc. For more details click here.
Part 2 covers best practices around using SQL Server Destination Adapter, kinds of transformations and impact of asynchronous transformation, DefaultBufferMaxSize and DefaultBufferMaxRows, BufferTempStoragePath and BLOBTempStoragePath as well as the DelayValidation properties. For more details click here.
Part 3 covers best practices around how you can achieve high performance with achieving a higher degree of parallelism, how you can identify the cause of poorly performing packages, how distributed transaction work within SSIS and finally what you can do to restart a package execution from the last point of failure. For more details click here.
Part 4 talks about best practices aspect of SSIS package designing, how you can use lookup transformation and what consideration you need to take while using it, impact of implicit type cast in SSIS, changes in SSIS 2008 internal system tables and stored procedures and finally some general guidelines. For more details click here.
Saturday, December 05, 2009 - 11:56:08 AM - arshad0384
I am sorry but I suspect your understanding here is not correct, let me help you to understand it correctly.
Best Practice #18 is specific to Flat File Connection Manager only, if you are using any other type of data source for example SQL Server, SSIS knows in advance the metadata of the source table (SQL Server table) and hence it uses this information to create buffer.
The number of buffer created is dependent on how many rows fit into a buffer and how many rows fit into a buffer is dependent on few other factors. The first consideration is the estimated row size, which is the sum of the maximum sizes of all the columns (data type) from the incoming records. The second consideration is the DefaultBufferMaxSize property of the data flow task. This property specifies the default maximum size of a buffer (refer Best Practice #6 and #7 for more details on this).
For example let say, you have an Employee table with just two columns as given below in SQL Server table: EmpID - INT EmpName - VARCHAR(100)
and you have data like EmpID EmpName 1234567890 A1A2B1B2....Y1Y2
Now the estimated size of each row would be somewhere around 104 = 4 (for intger) + 100 (for varchar) if the source is SQL Server.
Now lets assume you have the same information in a flat file, because every column is treated as string if the source is Flat File Connection manager then in that case estimated row size would be somewhere around 110 = 10 (max length of EmpID column value) + 100 (max length of EmpName column value). Hence it is recommended to type cast your column of flat file appropriately if you know the kind of data it stores to make room for more rows to accomodate in a single buffer.
One of my jobs which uses SSIS package fails frequently with the error in DataFlowTask with the error " failed the post-execute phase and returned error code 0x80004002". There is nothing more in the logs as well. But when the job is rerun, it succeeds. We have configured the job to retry a couple of times on failure. This step is writing data from sql server to Excel. The server is a 64 Bit server.
The error log is not having much information othere than the error message stated. However, this error is occuring only when the package is run from the SQL agent job.However, this error is occuring only when the package is run from the SQL agent job.
Problem is resolved by setting the RetainSameconnection property to true. It seems the failure is due to the fact that the Excel file is being written into by 2 tasks parallely though in different sheets of the same file.
I have a question. If the data type of a column in source and destination tables are changed from float to varchar then will the SSIS package has to be modified to accept these changes? or SSIS interface will automatically accept the changes?