Several executives tell us that they will be collecting data from thousands CSV or text files and we should expect to see at least ten new files a day. Instead of trying to import each file one by one, how can we create an SSIS package that pulls the data from each file into SQL Server while being able to handle a large number of files and data? Check out this tip to learn more.
Before we get address the solution for this tip, I want to be clear that this tip solely addresses CSV and text files, not Excel. Excel requires a similar process to this, which is not covered, and also assumes that the Excel file formats are identical (Excel allows users to change formats, often creating a headache with SSIS). Click here and here to learn more about working with Excel.
Now to the solution, we can load CSV and text files easily in SQL Server Integration Services (SSIS) without requiring much time, energy and also in a way that's constructed to handle more data growth. Let's jump right into the Business Intelligence Development Studio or Visual Studio to get started. Here are the SSIS Packages I have created for this tip, feel free to use them as a point of reference as you read this tip (note that one adds a derived column to also import the file name, the other is generic and allows you to adjust as you need).
1. Start with a Foreach Loop Container as shown in the first image below. We will place a few tasks inside of this container momentarily. Also, we'll create two string variables, one called ArchiveFolder and the other CurrentFile as shown in the second image below. The CurrentFile variable should be empty, while the ArchiveFolder variable should have the location where we want our files to be moved after the data are imported. To access the variables window navigate to View | Other Windows | Variables.
2. Next, we double click the For each Loop Container. On the Collection tab (the first image below), we'll enter the path where we are going to import the files from (in the Folder text box) - for instance, C:\OurFolder. In the Files text box, we'll enter the extension of the files (and select the "Fully qualified" radio button) - for instance, *.csv* would mean that we're importing CSV files. Next click we'll, click Variable Mappings tab (in the second image below), and insert the variable CurrentFile with an Index of 0 (should be default on the index).
3. Next, we'll insert three other tasks inside the For each Loop Container: a Data Flow Task, an Execute SQL Task, and a File System Task (as shown in the first image). All of these tasks should have the green arrows pointed to the next task as this will be our process as the looping begins. Note that the Execute SQL Task can go outside of the loop as well (meaning, that we'll manipulate the data after the import process as shown in the second image). While we can perform data manipulation inside of SSIS, depending on our experience, we can also manipulate it through code. The advantage of code is that it can handle changes better - for instance, executives telling us that they want the date formatted differently.
4. Inside of the Data Flow Task, add a Flat File Source and an OLE DB Destination. For the Flat File Source, we are going to initially select one of our flat files and use that as a guideline to create flat file connection. Once the Flat File Connection is listed under the Connection Managers (at the bottom, in this example called "OurFlatFileConnection"), we will change the Connection String Expression to "CurrentFile" so that it points to the current file during the loop. For "OurMSSQLConnection", we'll enter our server and database name and inside of the OLE DB Destination, point the flow to the right table.
5. Depending on where we set our Execute SQL Task, we'll adjust its properties as well - in this case calling our stored procedure to clean up the data. For instance, if we have chemical data that only need to show the fourth decimal place, and our text or CSV files have seven decimals places, we can clean the data with a stored procedure to round to the nearest fourth decimal place (for instance: CAST(ROUND(chemicals,4) AS DECIMAL(7,4)). We could also have the stored procedure transfer data from a staging (or temporary) table into a main table to ensure that our data is good, as well as keep a receipt of bad data.
With SSIS and data imports, keeping receipts of bad data will also help us communicate with the data sources where bad data are originating and this may help them identify problems in their systems.
6. The final step, we'll move the file to our archive folder. We do this by selecting the Destination Connection as the variable ArchiveFolder and setting the Source Connection as OurFlatFileConnection. Whatever the current file is during the loop process, that will be the file that is moved into the archive folder.
Sure would be nice if y'all had a link to print the tip & it's related tips in one fell swoop. I can only cut so much time out of my work day for learning. A printed version, paper or PDF, could ride hoem with me for later reading.
PS thank you for not using Captiva, it usually taked a half dozen attempts to get the code right :-)
Wednesday, April 03, 2013 - 3:31:32 PM - Greg Robidoux
@Tim Sullivan - there is a print option for all tips, but there is not a way to print multiple at one time. If you look above the "Comments and Feedback" section there is an option that will all you to print a clean version of the tip.
Wednesday, April 03, 2013 - 8:38:25 PM - Tim Sullivan
I use a ForEachFolder Loop to loop through close to 15 folders. Each folder has up to 25 csv files. The files have similar names in each of the 15 folders. What I noticed was that during each iteration the foreach folder loop does not read all the data from a file. For example, I have a file named FACT_Address.csv with 2287 records. When i run the for each folder container only 889 are loaded and there are no errors whatsoever. But if I run the task individually, all the 2287 rows are loaded.
Has anyone ever faced a similar problem with the ForEachFolder Loop?
Figured it out. It's in the properties of the flat file connection. Select the ... next to the expressions property and you should be able to add the connectionstring property with the current file expression.
I created the same package as above in ssis 2012 and it works in ssdt perfect. I deployed and when I execute it succeeds but It's not moving the files or anything. When I read the report that comes out I get a warning from the foreachloop saying it's empty. But it's not because when I run it from SSDT it works and transfers the files. Did I forget to deploy something I only have variables. Am I suppose to create paramters? Please help
@dina In testing on SSDT, make sure that it matches how it will run in production. If files aren't moving and data aren't importing, that usually means that something doesn't match between the test environment and production. Also, for the foreach loop, you want to make sure that a parameter exists (show above) which will hold the file currently being imported, so if you have no parameters in your package, see steps 1, 2, and 3.
@Tapan It depends on your environment, how often you receive the files, and how often you'll need the data. For instance, if you need the data weekly, then you want to do it at least weekly. If you go to my GitHub account (link: https://github.com/tmmtsmith/SSISLoopWithFileName), it has a demo SSIS package that also grabs the file name, so yes, it's possible.
My SSIS package is already working with a hardcoded filename from the folder as a flat file source. The issue I am coming across is, there will be multiple files with the date as the filename. My SSIS package should only import the most recent file and transfer it over to the db. How can I achieve that?
My SO question is here: http://stackoverflow.com/questions/22969449/how-to-use-the-latest-file-in-a-folder-for-source
If you're calling your package in a job step, you create a job step before calling the package and set the step to use PowerShell and call this function, https://github.com/tmmtsmith/Powershell/blob/master/RemoveOlderItems.ps1, by passing in the file location of these files. This function will remove the oldest files in the directory, and keep the most recent, so that you only import the most recent file.
Thanks very much for putting this together. I have been putting off converting some old DTS packages that have been playing up since moving to SQL 2008 as I wasn't looking forward to converting the Active X scripts that do this task. I was able to get it running within an hour with the help of your article so no longer need Active X!
It was also my first play with SSIS 2012 and it is much better than SSIS 2008!