Learn more about SQL Server tools

mssqltips logo
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

 

Excel File Data Validation for SQL Server Imports


By:   |   Last Updated: 2015-09-03   |   Comments (4)   |   Related Tips: More > Import and Export

Problem

As a SQL Server Business Intelligence Professional, I work with numerous large Excel files every day. I need to either submit files to others or upload the files to my SQL Server data warehousing system. To ensure the quality of my SQL Server data, I want to do Excel file data validation of the data in these files per business requirements before I even start to process them. What can I do?

Solution

To validate data, we can rely on Regular Expression for Excel files. One common way to process Excel files is to use the COM Interop interface. In PowerShell, that code is like the following:

$ExcelApp = new-object -com Excel.Application;
#then you can handle Excel files using this $ExcelApp

However, this approach requires that the Excel application be installed. In many server environments, Microsoft Office is not installed. On the other hand, the performance of this COM approach is usually slow when handling big Excel files and is not supported:

Microsoft does not currently recommend, and does not support, Automation of Microsoft Office applications from any unattended, non-interactive client application or component (including ASP, ASP.NET, DCOM, and NT Services), because Office may exhibit unstable behavior and/or deadlock when Office is run in this environment.

So in this tip, I will use a community-tested and approved approach that does not rely on an Excel installation to work with Excel files. This approach relies on an open-sourced DLL file, EPPlus.dll.  After downloading the zip file and extracting the EPPlus.dll to a folder, say C:\Tools\EPPlus.dll, right-click the file and from the pop-up menu click properties and then Unblock it as shown below:

unblock epplus.dll

This solution has two key points:

  • Use EPPlus.dll to read Excel files
  • Use Regular Expression to perform validations on the cells of interest

PowerShell Source Code for Excel File Data Validation

Below is the PowerShell source code to be used for the Excel file data validation.

add-type -path c:\tools\EPPlus\EPPlus.dll

<#
.SYNOPSIS
Validate data in Excel files (for Excel 2007 and later versions only)
.DESCRIPTION
Validate data in Excel files via Regular Expression or just common strings. The validation can be on
multiple columns, each with different validation criteria. 
.INPUTS
ExcelFile: The full UNC path to the Excel file;
Condition: Hashtable format, put ColumnName and its corresponding verification criteria in such format as
@{ColumnName='criteria'; ColumnName2='criteria2'; ...}
Sheet: indicates the Excel file sheet number, so that the verification will be carried on this sheet. The default value is 1
 
.OUTPUTS
System.Data.DataTable
    The data table will contain all abnormal data (i.e. failing the verification criteria), The data table will have 3 columns, Rows | ColumnName | Text, and this
    will assist quickly locating the abnormal data in the Excel sheet.
.EXAMPLE
Validate-ExcelData -ExcelFile C:\temp\Person.xlsx -Condition @{Phone='^\d{3}-\d{3}-\d{4}$'; Province='^\w{2}$'} -Sheet 1;
This command validates two columns in the first worksheet of c:\Temp\Person.xlsx, to verify that column [Phone] has the format of ddd-ddd-dddd (d is a digit number, 0 to 9).
It also verifies that column [Province] should have a value of two characters, like AB, ON, BC (these are all Canadian province name abbreviations)

#>

function Validate-ExcelData 
{

    param(
    [Parameter(Position=0, Mandatory=$true)] [ValidateScript ({test-path $_ -pathtype 'leaf'})]  [string]$ExcelFile,
    [Parameter(Position=1, Mandatory=$true)] [ValidateNotNullOrEmpty()] [hashtable][email protected]{}, # The format is "ColumnName='regular-expression'"
    [Parameter(Position=2, Mandatory=$false)] [ValidateScript ({ ($_ -gt 0) -and ($_ -le 100)})] [int]$Sheet=1
    )

    try 
    {    
        $result = New-Object "System.Data.DataTable"
        $c = New-Object "System.Data.DataColumn" ('Row', [System.Int32]);
        $result.Columns.Add($c);
        $c = New-Object "System.Data.DataColumn" ('Col', [System.String]);
        $result.Columns.Add($c);
        $c = New-Object "System.Data.DataColumn" ('Text', [System.String]);
        $result.Columns.Add($c);


    
        $xl = New-Object OfficeOpenXml.ExcelPackage $ExcelFile;

        $wb  = $xl.Workbook
        if ($wb.Worksheets.count -lt  $sheet) #$sheet cannot be beyond the existing WorkSheets
        {
            Write-Error "There is no [$sheet] sheet in this Excel file." -ea stop;
        }

        $ws=$wb.Worksheets[$Sheet]
        $dm=$ws.Dimension

        $Rows=$dm.Rows;
        $Columns=$dm.Columns;
    
        [string[]]$array_col= 1..$Columns| %{$ws.Cells[1, $_].Text.toupper()};

        [hashtable][email protected]{};
        $condition.keys | % {$valid_cond.add($_, $([System.Array]::indexof($array_col, $_.toupper())))}
        $valid_cond.GetEnumerator() | % {
            if ($_.Value -eq -1)
            {
                write-error "Column [$($_.Key)] cannot be found in Excel file [$ExcelFile]" -ea Stop; 
            }
        }

        foreach ($itm in $valid_cond.GetEnumerator())
        {
            foreach ($row in 2..$rows)
            {
                if ($ws.Cells[$row, ($itm.value+1)].text -notmatch $($Condition.get_item($itm.Key)))
                {
                    $r = $result.NewRow();

                    $r.Row = $row;
                    $r.Col = $itm.key;
                    $r.Text = $ws.Cells[$row, ($itm.value+1)].Text;
                    $result.rows.add($r);
                }
            }
        }#itm
        Write-Output $result;
    }#try
    catch
    {
        Write-Error $_;
    }

    finally 
    { 
        $result.Dispose();
        $result = $null;
        $xl.Dispose();
        $xl=$null;
    }
}
#uncomment and modify the following parameters to do you test in a PS ISE window 
#Validate-ExcelData -ExcelFile C:\temp\Person.xlsx -Condition @{Phone='^\d{3}-\d{3}-\d{4}$'; Province='^\w{2}$'} -Sheet 1;

Example

I ran the following test:

Step 1

Generate test data, I used AdventureWorks2012 to dump data out as csv file.

# generate test data
import-module sqlps -DisableNameChecking
set-location c:[string]$qry [email protected]"
select p.Name ,SalesOrderID, SalesOrderDetailID, CarrierTrackingNumber, OrderQty, s.ProductID, s.UnitPrice, s.LineTotal, s.ModifiedDate
from sales.salesorderdetail s
inner join Production.Product p
on p.ProductID =s.ProductID;
"@
# tp_w520 is my laptop name, change it to your own
invoke-sqlcmd -server tp_w520 -database AdventureWorks2012 -query $qry | Export-Csv -Path c:\temp\Test.csv -Force -NoTypeInformation;

Step 2

The test data has 120,000+ rows. I used Excel 2013 to open the c:\temp\Test.csv file and modified the last three lines as shown below to make the data abnormal and then I saved the modified file as c:\temp\Test.xlsx

ExcelFile Modification Image


  • 1st change: [SalesOrderID] should be a number(RegEx check='^\d+$'), but I add a letter "A" in cell (121318, B) to make it invalid
  • 2nd change: [ProductID} should be a number(RegEx check='^\d+$'), but I add a letter "A" in cell (121317, F) to make it invalid
  • 3rd change: [ModifiedDate] should be a date as MM/DD/YYYY((RegEx check='^\d{1,2}/\d{1,2}/\d{4}$'), but I made cell (121316, I) to be 7/131/2008 to make it an invalid data.

Step 3

Open the PowerShell command line window (ISE is OK as well) and run the following 2 lines of code

. c:\temp\validate-ExcelData.ps1 #dot source the function script, which is saved in c:\temp\ folder
Validate-ExcelData -ExcelFile C:\temp\test.xlsx -Condition @{SalesOrderID='^\d+$'; ProductID='^\d+$'; ModifiedDate='^\d{1,2}/\d{1,2}/\d{4}$'}

The result is shown below:

PS result

Summary

This tip suggests a solution to validate Excel data via PowerShell and Regular Expressions.  The performance is pretty good, for the example above it took less than 15 seconds to run on my 5 year old laptop (Win 7, 8GB). In the ETL world, another common file type is a CSV file, this generally can be done via Import-CSV and then applying regular expressions to check the corresponding columns.

Next Steps

Please read the following tips that have some similar topics:



Last Updated: 2015-09-03


next webcast button


next tip button



About the author
MSSQLTips author Jeffrey Yao Jeffrey Yao is a senior SQL Server consultant, striving to automate DBA work as much as possible to have more time for family, life and more automation.

View all my tips




Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Email me updates 


Signup for our newsletter
 I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.



    



Friday, August 25, 2017 - 1:30:49 AM - jeff_yao Back To Top

@Ed, 

The assumption is all those Excel columns have pre-defined format for each value according to business requirement, for example, we assume [SalesOrderID] column is digits only, and as such, if there is a character inside a [SalesOrderID] cell, it means this is an invalid value. 

With the knowledge of such assumption (i.e. what data format should be), you can come up with your own regular expression validation formula.

HTH,

jeff


Monday, August 21, 2017 - 6:33:06 PM - Ed Back To Top

 In Step 3, what is the purpose of including "Condition @{SalesOrderID='^\d+$'; ProductID='^\d+$'; ModifiedDate='^\d{1,2}/\d{1,2}/\d{4}$'}" ? 

If this were an actual validation, how would I know what conditions to include in this final step until the data has been validated?

 Thank you.


Tuesday, September 15, 2015 - 1:17:50 PM - jeff_yao Back To Top

@Dennis, thanks for your comment. There are cases where importing to Staging tables many not be the ideal solution. 

On top of that, validating values without RegEx (which native t-sql is lacking) can be tough or impossible in quite many cases.


Monday, September 14, 2015 - 10:27:27 AM - Dennis Back To Top

Why learn new technologies for simple validation process....  Just import into a Staging table - all nvarchar , and then do validation processing in SQL code.  (SSMS Import wizard will even create the staging table for you if it does not exist)


Learn more about SQL Server tools