We hosted a webcast with the subject Integration Services Best Practices and Tuning Tips on September 15, 2015. There were several questions for which we didn't have time left to answer, so this tip takes the opportunity to answer those questions. Below are the questions, along with their answers. If you still have questions left, you can submit them in the comments section. You can watch the webcast recording here.
Q: Do you have an example of logging to a table with SQL Server Integration Services?
Since this question needs an elaborate answer, there is a separate tip Integrated Logging with the Integration Services Package Log Providers which explains the integrated log providers of SSIS packages in greater detail.
Q: Could BIML be used to generate SSIS packages with logging?
An introduction to BIML - which is an XML format to generate SSIS packages based on metadata - is described in the following tips:
- Introduction to Business Intelligence Markup Language (BIML) for SSIS
- Using BIML to Generate an SSIS Import Package
- Generate multiple SSIS packages using BIML and metadata
On the blog of SQL Server MVP Catherine Wilhelmsen there's a great resource page for BIML.
Q: How can you use one version of BIDS/SSDT-BI when you have to support multiple Integration Services versions? (2005, 2008R2, 2012, 2014, etc.)
Well, for the moment, you can't. In the webcast I mentioned to try to stick to one single version of SSIS (each version of SSIS requires a different version of Visual Studio). This is an ideal scenario of course. I included it in the webcast because I had a client which was using Visual Studio 2008, Visual Studio 2010 and Visual Studio 2012 for packages in one single project. It was really messy, because they only upgraded packages to the latest version of Visual Studio when they had to make a change to a package, not the entire project at once. So my advice is to migrate entire projects at a time, not package by package. In SQL Server 2012, you can choose between Visual Studio 2010 (which comes with the SQL Server set-up) and Visual Studio 2012 (which is a separate download). In that case, the entire team should try to use only one version, not mixing the two just for the sake of consistency.
However, when you need to support multiple version of Integration Services, you need to have multiple version of Visual Studio because there is no backwards compatibility. I currently have 4 different versions of Visual Studio on my laptop. The good news is that they can work independently of each other. It has been announced - but not 100% confirmed - that Visual Studio will support backwards compatibility in SQL Server 2016, but this is still uncertain for the moment.
For more information on the subject: The SSDT Evolution.
- Do you have to use the same version of Visual Studio as SQL Server? Yes.
- Can any of the older versions work with the latest versions of SQL Server? No. Currently there is no forward or backward compatibility, at least for SSIS. SSAS and SSRS have better support for backwards compatibility.
- Is there a chart anywhere to tell us what Visual Studio versions works with specific versions of SQL Server? This blog post on MSDN gives a good overview of general Visual Studio compatibility.
Q: What are the most efficient ways to run a large aggregate SQL Server Integration Services Job? I split my data flows to do aggregation, then handle the results separately.
Not exactly a question, but rather a statement on my best practice that aggregations (and sorting) should be done in the SQL Server database, not in the SSIS data flow. The reason for this is that SSIS uses blocking components to do these types of transformations which means they need to read all of the rows into memory before they can output even one single row.
Separating the flow into separate streams has certain merit, since you're doing a "divide and conquer" strategy. High-level this corresponds with the same map-reduce method of big data systems. However, this approach has two options: you either execute the different data flows in parallel or you execute them one after another. The first option still has the issue that everything has to be loaded into memory at once. The second option has a performance impact since a data flow always has to wait for the previous data flow to finish. Both options still need sorted data. This might be resolved by reading from a table with an index supporting the sort order, since then the database engine can just read the sorted data. You still have to mark the input as sorted in the advanced editor of the source component in the data flow. The tip Merge multiple data sources with SQL Server Integration Services shows how you can do this.
My opinion is that in most cases, the database engine is best suited to handle large set based operations such as aggregating and sorting data. The data flow is capable of handling small data sets which fit in memory without issues. The SSIS package however becomes less scalable for larger data loads. SQL Server can scale reasonable well. If the aggregation is too much even for SQL Server to handle, alternative options such as the map-reduce method can be considered.
Q: Can you use the Slowly Changing Dimension wizard if you don't use the auto-generated OLEDB Command transformation? Can you write those rows to a staging table and process them in a separate Execute SQL Task?
This is indeed a valid method to increase performance of a data flow using the Slowly Changing Dimension (SCD) wizard. I use the same method to handle updates in a data flow: write them to a staging table and handle the updates with a set-based UPDATE statement in an Execute SQL Task. The reason for this is because the only option in the data flow to do updates is the OLE DB Command, which process data row by row. This is not only very slow for data sets over 100 rows, but it also means this update transformation is not minimally logged in the database.
One downside of this method is that if you run the wizard again, you lose all your changes since the wizard regenerates the data flow and you have to redo your actions.
Q: The webcast mentions SSIS restartability, but not checkpointing. Why?
There are several reasons for this:
- Checkpointing is not that straightforward to implement.
- Checkpointing works well for only a limited number of scenarios. Looping or parallelism are difficult to handle.
- Checkpoints are not that flexible.
- Event handlers are problematic.
- Checkpoints do not store variables of type Object.
SQL Server MVP Jamie Thomson explains his objections in the blog post Why I don't use SSIS checkpoint files (it's an old post, some links are broken). The discussion is also interesting to read.
Personally I prefer to create small modular packages. Each package only does a basic task that shouldn't take too long. For example, load a dimension. I won't create one single huge package that will load all dimensions of a data warehouse. The most time is usually spent in the data flow, which isn't covered by a checkpoint anyway. If I would have an Execute SQL Task at the end of the package taking an hour to run, I will most likely put it in a separate package so it can easily be restarted.
If you have a valid use case for using checkpoint files, you can check out the tip Integration Services Checkpoints to restart package from failure.
Q: Can Integration Services packages be branched in source control? I had heard because of binary code in script tasks that branching \ merging wasn't possible.
To be honest, I haven't had a project where I used branching on SSIS packages with script tasks. The SSIS team at Microsoft suggests using branching as a method to deal with the project deployment model and the inability to deploy individual packages, so I would assume (maybe wrongly) that version control should be able to handle this. Maybe script tasks are indeed an issue with branching and in that case you are better to revert the package to an earlier working version, deploy the project and then revert the package back to its latest state.
Q: Regarding your update\insert data method, is there a reason you didn't use a MERGE T-SQL statement?
Yes, but most reasons are personal or dependent on requirements:
- SQL Server 2005 doesn't support the MERGE statement, so the upsert method in SSIS is your only choice there. I started developing on SSIS 2005 and that's when I picked up the upsert pattern.
- The upsert pattern works when your source data is on another server or when the source isn't relational. MERGE has more difficulties with this.
- MERGE doesn't give you the row count of updated, inserted or deleted rows. It only gives you the row count of all the rows that were affected by MERGE. You could use the work around where you dump all the rows into a temporary table using the OUTPUT option of the statement and then count all the different rows, but this has an impact on performance. With the upsert pattern, you have a better view on the row counts.
The main reason for not using MERGE in my case is the last reason, as usually I need the row counts for audit and trace reasons. However, if you do not have this requirement, MERGE is a great alternative solution and I do not particularly advise against it. If you find that it works for your project, by all means use it. The tip Use Caution with SQL Server's MERGE Statement by Aaron Bertrand gives some warnings about its usage.
- Couldn't you use a MERGE Statement instead of a bunch of OR statements? Yes, I could as I explained above. But I didn't because I also needed a use case to show the magic ALT trick.
Q: If your choices were oledb vs. odbc, which would you prefer?
OLE DB. It doesn't require a special set-up like ODBC does; you only need to have the client providers installed on the machine. OLE DB in SSIS is supposedly deprecated by Microsoft, but I don't see any change coming any time soon. Most of the time, OLE DB is the fastest option. Another reason is that in earlier versions of SSIS, there was no ODBC source or destination and using ODBC required more effort to get it working, while OLE DB worked directly out of the box.
Q: In SQL Server 2012, can't you set up an SSIS catalog and use that for version control? It allows for roll backs as well.
Yes, that is possible. When you deploy a project to the SSIS catalog, it will keep track of the different deployments and you can roll back to a previous version. However, this "version control" is quite rudimentary and it is only on the project level. There is no versioning on the individual packages. Full version control on the actual source code - the individual packages - in Visual Studio is still necessary.
Q: Do you ever mess around with some of the more obscure settings such as MaxBufferSize? I've found it to be a bit of a black art, but it can yield significant performance benefits.
MaxBufferSize is an internal non-configurable property - as explained in the blog post Buffer Size in SSIS - so I guess the question is about DefaultBufferSize and DefaultBufferMaxRows. I change those two properties quite frequently. Most of the time I make them a lot bigger, so that buffers in the SSIS data flow can transport more rows at once, which can indeed yield big performance benefits. I give an example in the tip Improve SSIS data flow buffer performance.
Q: I've not been able to figure out where package templates are held for VS2013 with SSDT...
The location is similar to previous versions. On my machine this is the folder - C:\Program Files (x86)\Microsoft Visual Studio 12.0\Common7\IDE\PrivateAssemblies\ProjectItems\DataTransformationProject\DataTransformationItems. For more information about using templates: Creating SSIS Package Templates for Reusability.
Q: That ALT + select trick by itself was worth joining this for!!!
Indeed it is! For people who do not know what we're talking about, watch the recording of the webcast at around 39:30 or read the blog post The Magic Alt Button.
Q: What's the best way to identify if backpressure is occurring?
Backpressure in SSIS is the phenomenon where the source has to slow down because the data flow cannot keep up its pace. Most of the time this is caused by a slow destination. You can quickly diagnose this by removing the destination and connecting the data stream to a dummy transformation such as a Multicast. If your package goes lightning fast, the destination is the problem. It is always expected that the destination gives performance overhead: the package is writing to disk and disks aren't always as fast as we want them to be. A performance tuning effort might be to make sure that the writes to SQL Server are minimally logged. If they are not, the destination will be much slower. Check out this recording of David Peter Hansen of his session on SQLBits on minimally logged operations: Bulk load and minimal logged inserts.
- If you watched the webcast or the recording and you still have questions, you can always ask them in the comments. If you asked a question during the webinar and it is not featured in this tip, it's possible the question wasn't clear to me. Please use the comment section to clarify.
- Take a look at the MSSQLTips.com website to see which webcasts are planned: Free SQL Server Webcasts, Videos and Webinars.
- You can find more tips about SSIS best practices here.
Last Update: 2015-11-04
About the author
View all my tips