In the previous tips (SQL Server Integration Services (SSIS) - Best Practices - Part 1, Part 2 and Part 3) of this series I briefly talked about SSIS and few of the best practices to consider while designing SSIS packages. Continuing on this path I am going to discuss some more best practices of SSIS package design, how you can use lookup transformations and what considerations you need to take, the impact of implicit type cast in SSIS, changes in SSIS 2008 internal system tables and stored procedures and finally some general guidelines.
In this tip my recommendations are around
- lookup transformation and different considerations which you need to take while using it,
- changes in SSIS 2008 system tables and stored procedures,
- impact of implicit typecast and
- finally some general guidelines at the end.
Best Practice #15 - Did you know, SSIS is case-sensitive?
In one of my projects, once we added one new column in a source table and wanted it to be transferred to a destination table as well. We did the required changes in our SSIS package to pull the data for this additional column. But when we started testing we noticed our SSIS package was failing with the following error.
|[OLE DB Destination ] Warning: The external columns for component "OLE DB Destination" (16) are out of synchronization with the data source columns. The column "EmployeeId" needs to be added to the external columns.|
The external column "EmployeeID" (82) needs to be removed from the external columns.
[SSIS.Pipeline] Error: "component "OLE DB Destination" (16)" failed validation and returned validation status "VS_NEEDSNEWMETADATA".
After spending several frustrating hours investigating the problem, we noticed that even though our SQL Server is case insensitive, the SSIS package is case sensitive. The reason for the above failure was that we altered the table for one column from EmployeeId to EmployeeID and since the SSIS package has stored source and destination column mappings with the old definition, it started failing because of this column name case change. So whenever you get this kind of error, match your source/destination columns case with the mapping stored in the SSIS package by going to the mapping page of OLEDB destination adaptor of the Data Flow Task.
Best Practice #16 - Lookup transformation consideration
In the data warehousing world, it's a frequent requirement to have records from a source by matching them with a lookup table. To perform this kind of transformation, SSIS has provides a built-in Lookup transformation.
Lookup transformation has been designed to perform optimally; for example by default it uses Full Caching mode, in which all reference dataset records are brought into memory in the beginning (pre-execute phase of the package) and kept for reference. This way it ensures the lookup operation performs faster and at the same time it reduces the load on the reference data table as it does not have to fetch each individual record one by one when required.
Though it sounds great there are some gotchas. First you need to have enough physical memory for storage of the complete reference dataset, if it runs out of memory it does not swap the data to the file system and therefore it fails the data flow task. This mode is recommended if you have enough memory to hold reference dataset and. your referenced data does not change frequently, in other words, changes at reference table will not be reflected once data is fetched into memory.
If you do not have enough memory or the data does change frequently you can either use Partial caching mode or No Caching mode.
In Partial Caching mode, whenever a record is required it is pulled from the reference table and kept in memory, with it you can also specify the maximum amount of memory to be used for caching and if it crosses that limit it removes the least used records from memory to make room for new records. This mode is recommended when you have memory constraints and your reference data does not change frequently.
No Caching mode performs slower as every time it needs a record it pulls from the reference table and no caching is done except the last row. It is recommended if you have a large reference dataset and you don't have enough memory to hold it and also if your reference data is changing frequently and you want the latest data.
To summarize the recommendations for lookup transformation:
- Choose the caching mode wisely after analyzing your environment and after doing thorough testing
- If you are using Partial Caching or No Caching mode, ensure you have an index on the reference table for better performance.
- Instead of directly specifying a reference table in he lookup configuration, you should use a SELECT statement with only the required columns.
- You should use a WHERE clause to filter out all the rows which are not required for the lookup.
- In SSIS 2008, you can save your cache to be shared by different lookup transformations, data flow tasks and packages, utilize this feature wherever applicable.
Best Practice #17 - Names of system tables and procedures have changed between SSIS 2005 and SSIS 2008
SSIS gives you different location choices for storing your SSIS packages, for example you can store at file system, SQL server etc. When you store a package on SQL Server, it is stored in the system tables in msdb database. You can write your own code to upload/download packages from these system tables or use un-document system stored procedures for these tasks.
Now the twist in the story is, since SSIS 2005 has grown up from DTS, the system tables and system stored procedures use a naming convention like "dts" in its name as you can see in first column of table below. With SSIS 2008, the SSIS team has standardize the naming convention and uses "ssis" in its name as you can see in the second column of the table below. So if you are using these system tables or system stored procedure in your code and upgrading to SSIS 2008, your code will break unless you change your code to accommodate this new naming convention.
Best Practice #18 - Be aware of implicit typecast
When you use Flat File Connection Manager, it treats all the columns as string [DT_STR] data type. You should convert all the numeric data to appropriate data type or else it will slow down the performance. You are wondering how? Actually SSIS uses buffer oriented architecture (refer Best Practice #6 and #7 for more details on this), it means it pulls the data from the source into the buffers, does the transformations in the buffers and passes it to the destinations. So as many rows as SSIS can accommodate in a single buffer, performance will be better. By having all the columns as string data type you are forcing SSIS to acquire more space in the buffer for numeric data types also (by treating them as string) and hence performance degradation.
Tip : Try to fit as many rows as you can into the buffer which will eventually reduce the number of buffers passing through the SSIS dataflow pipeline engine and improve overall performance.
Best Practice #19 - Finally some more general SSIS tips
Merge or Merge Join component requires incoming data to be sorted. If possible pull a sorted result-set by using ORDER BY clause at the source instead of using the Sort Transformation. Though there are times, you will be required to use Sort transformation for example pulling unsorted data from flat files.
As I said above there are few components which require data to be sorted as input to them. If your incoming data is already sorted then you can use the IsSorted property of output of the source adapter and specify the sort key columns on which the data is sorted as a hint to these components.
Try to maintain a small number of larger buffers and try to get as many row as you can into a buffer by removing unnecessary columns (discussed in Best Practice #2) or by tuning DefaultBufferMaxSize and DefaultBufferMaxRows properties of data flow task (discussed in Best Practice #7) or by using the appropriate data type of the column (discussed in Best Practice #18).
If you are on SQL server 2008, you can utilize some of its features for better performance. For example you can use the MERGE statement for joining INSERT and UPDATE data in a single statement while incrementally uploading data (no need for lookup transformation) and Change Data Capture for incremental data pulls.
RunInOptimizedMode (default FALSE) property of data flow task can be set to TRUE to disable columns for letting them flow down the line if they are not being used by downstream components of the data flow task. Hence it improves the performance of the data flow task. The SSIS project also has the RunInOptimizedMode property, which is applicable at design time only, which if you set to TRUE ensures all the data flow tasks are run in optimized mode irrespective of individual settings at the data flow task level.
Make use of sequence containers to group logical related tasks into a single group for better visibility and understanding.
By default a task, like Execute SQL task or Data Flow task, opens a connection when starting and closes it once its execution completes. If you want to reuse the same connection in multiple tasks, you can set RetainSameConnection property of connection manager to TRUE, in that case once the connection is opened it will stay open so that other tasks can reuse and also in that single connection you can use transactions spanning multiple tasks even without requiring the Distributed Transaction Coordinator windows service. Though you can reuse one connection with different tasks but you should also ensure you are not keeping your connection/transaction open for longer.
You should understand how protection level setting works for a package, how it saves data (in encrypted form by using User key or password) or it does not save data at all and what impact it has if you move your package from one system to another, refer here for more details on this.
The above recommendations have been done on the basis of experience gained working with DTS and SSIS for the last couple of years. But as noted before there are other factors which impact the performance, one of the them is infrastructure and network. So you must do thorough testing before putting these changes into your production environment.
I am closing this series on SQL Server Integration Services (SSIS) - Best Practices with this Part 4, if users find any other best practices (I am sure there might be several others) which I missed here, I request you to kindly provide your comments on that so that other can get benefited with our experiences.
Last Update: 12/4/2009
About the author
View all my tips