We recently faced an issue with an internal data provider where we imported a flat file at 1AM, but in the past few months, the data provider began providing an incomplete or inaccurate file because it lacked the proper length (or didn't exceed a certain size). We knew that if the file had over two million lines, it was legitimate, yet incomplete files would appear with only a few thousand lines, causing the import process to start which led the automated morning system to "think" it had imported the data for the day. Because of the overnight process, no developer was around to recognize the problem and by morning, production was ready to go with an incomplete data set.
We could offer a couple of solutions outside of coding a solution: (1) have someone check everything early before production starts or (2) let go of the data provider (for large organizations, this often won't work - startups have a huge advantage here). In our case, we implemented a check on the file before importing. This helped in our case because the file size limited the latest time that the file could be imported (4 AM was too late, it had to be started by 3 AM), but consider that this may not be an issue for you, so you could re-loop a check, or call an SSIS package (with a script check) again.
For this solution, we can perform file checks and ensure that we're importing the correct file. In the case of the data provider, if the file exceeded a certain line length or a file size, it was a legitimate file. Some situations may call for other checks and these checks should happen before an import and either (1) stop the import process and notify developers (as customers need to know why daily data are not present), or (2) call the SSIS package or a SQL Server Agent Job again later. For this example, we will use the file inflationdata.csv; note that any file can be checked by changing the file path.
In the case of file length, in a job step before a PowerShell import, we can measure the file line length; or as in the below code, we can wrap an import around an IF statement. The PowerShell code check is below:
$f = Get-Content "C:\files\inflationdata.csv"
$l = $f.Length
if ($l -gt 120000)
## Import function
A C# check, which can be altered if using an SSIS script task, is below:
long cnt = File.ReadLines(@"C:\files\inflationdata.csv").Count();
if (cnt > 120000)
Screenshots of testing (PS Write-Host or C# Console.WriteLine help in testing before wrapping imports):
As a note here, another approach would be to measure the size of the file. For instance, if the file exceeds a certain amount of MB/GB, then import. In our case, we knew that if the file was over a certain amount of lines, it was valid and could be imported. For a separate process, though, we also used the below approach because one of our files had to exceed 2.7GB before we knew it was valid and obtaining the file size was the faster approach. The below code shows this:
FileInfo fl = new FileInfo(@"C:\files\inflationdata.csv");
// MB (fl.Length is measured in bytes, so for MB use 1024 squared):
double mb = (fl.Length / 1048576.00);
//// GIG for large files (use 1024 cubed):
//double gig = (fl.Length / 1073741824.00);
if (mb > 3)
This solution eliminated the problem and with automation we were able to notify customers early if data from the data provider wouldn't appear for that day. In large scale operations where we may not have the choice to switch data providers, we definitely want to alert our customers if their is an error.
If you need to perform checks on file imports, wrap imports with necessary checks.
Test file line length and file size.
Last Update: 5/6/2014
About the author
Tim works as a DBA and developer and also teaches Automating ETL on Udemy.
Good article. I'm not a powershell user, but I definitely see its usefullness here. One thing that we have to check on some of our imports is the file date. We have one import that runs every 15 minutes during business hours, and then a full replacement that runs early in the morning every day. We don't want to do the import during the business hours if it is older data from the previous day, so we add an extra layer of checking to make sure that the file date is newer than 30 minutes old. We've used the extended stored procedures from SQL, but that requires adding extra permissions. Does powershell let you get around the extra permissions, and what would be the command for that? Thanks
One thing that I always try to do as a developer is to write code that makes what I'm doing obvious. Even though you added comments, personally I would write MB as 1024 * 1024 and GB as 1024 * 1024 * 1024 instead of doing the math and writing 1048576.00 and 1073741824.00, respectively. These numbers might not make any sense to someone maintaining your code, but 1024 should be very familiar.
@Wise Thanks! It depends on your environment; for instance, if I were solely a developer and I was facing environment restrictions, I would get in contact with the DBA. Ultimately, at large environments, restrictions may exist for excellent reasons and thus demonstrating why something is useful and saves company resources makes a strong case. PowerShell from SQL Server Agent, for instance, will use the permissions granted to it in the agent.
@Jeff Yes, it will load it into memory, which may limit environments which have restrictive requirements or do not set aside a server for ETL processes (not recommended if ETL is a big part of the company). In those cases, or even for large files where we want to minimize memory, I'd suggest the below approach with PowerShell, which had relatively low impact on memory in testing with a 1GIG file:
I realize that it costs a little money, but there are also CLR functions/procedures within the SQL# http://www.sqlsharp.com library which does all of this in a simple way.
-- determine if file exists
-- if you want to find out some properties of the file this would return a table with: (file name, file location, length/size, creation time, last access time, last write time, read only flag, hidden flag, archive flag, compressed flag, encrypted flag, temporary flag, type, level, and any error message)
SELECT * FROM SQL#.File_GetFileInfo('C:\somefile.csv');
-- if you want to found out the # of rows in the file, you have to examine it. I believe this streams the file, and doesn't load it into memory...but I'd have to ask the creator of the library
SELECT LineNum FROM SQL#.File_GetFile('C:\somefile.csv',0);
I've used all of these with great success in many production jobs within a large organization.
Wednesday, August 13, 2014 - 2:30:03 AM - Solomon Rutzky
Hi there. I would like to make a minor correction to something that @David mentioned in his comment regarding the SQL# function File_GetFile(). That function, at least in that particular usage, does not stream the contents of the file and does read it entirely into memory. But, that is mainly due to the second parameter -- which is @SplitLines -- being set to 0 which means "return the entire contents of the file as a single record". While that would get you the total number of lines, you could pass in 1 for @SplitLines which will stream the contents of the file, one row at a time (as of version 3.2). This would allow for something like the following:
DECLARE @TotalLines INT; SELECT @TotalLines = LinNum FROM SQL#.File_GetFile('C:\somefile.csv', 1);
On the other hand, it would be faster to call a function that was just added in the most recent version which does basically the same thing as the GetLastLineNumber function shown in the comment above by @Tim. The function is File_GetLineCount() and it just reads a line from the file and discards it. So, same as with the GetLastLineNumber function, the maximum memory consumed is the longest line of the file. And that is really only an issue if the file has a 500 MB (or larger) string of characters before the first newline or before the EOF.