By: Brian P ODwyer | Comments (6) | Related: > Microsoft Excel Integration
Problem
We are migrating from a SQL Server that has been around for almost 10 years and it is used by an entire sister company for data reporting both with third party tools and more specifically with Excel. There are Excel spreadsheets that contain data connections to this server from Excel version 2003 to Excel 2013. These spreadsheets may be run periodically; once a week, a month or even once a year and the original creator may have long since gone with the new owner of the spreadsheet/process unaware of where or how the data is retrieved. It is not a simple thing to ask for a list of all spreadsheets that have data connections in them and then fix those on the list. There is a great likelihood that most people will not understand what you are asking them, nor will they remember the once a year spreadsheet they use until they actually have to use it.
We are then left with several alternatives, scanning for and creating a list of all spreadsheets and opening them one by one to see if they have a connection or relying on a list compiled by other people that may not be as accurate as we want. Of these the first is more likely to get most of the spreadsheets that could have a database connection, but opening them all one by one is not appealing. Fortunately we do have some tools we can leverage to help us do this and to automate a process of looking inside the spreadsheet to see if it does have a database connection. We can also make use of our toolset that is free and flexible.
We are going to fall back on PowerShell to make our lives easier and make use of some features of Excel that can help make the job easier. The nice part of Excel 2007 and higher is that the xlsx file is actually a zip file with the contents being xml files which means it is text and we can search that. We just need to know which file to look at and how to pull the xml into memory and search it. For the earlier versions, Excel is regrettably in a binary format, but all is not lost with a change in the encoding page used we can convert it into a string thus avoiding the byte array approach and making the search a simple string search. We now have a way to search inside the Excel files for database connections and we combine this with file searching ability in PowerShell to create an automated method of finding and searching the files, spitting out those that have database connections. We have increased the likelihood of getting the largest list of Excel files meeting our criteria and we filter to give us those we need only.
Of course this is not perfect, there are some things we cannot get around such as path lengths over 248 characters, directories we do not have permission to see/search, corrupt Excel files (or those with different date/time code pages) and finally password protected files. We would run into these issues if we did a scan only or if we were given a list of files. Something's are unavoidable, but we are looking for the best effort and the PowerShell method should do that for us. In the script I turn off errors so that it runs cleaner, but if you want them exposed you can comment that line out and pickup these issues and address them as well. That means you have to use another method to scan for Excel files if you want pick up the errors individually, the commandlet may not return the errors individually and explicitly except for the last one in the try/catch block. Finally, if you need higher level permissions to access folder/volumes you will need to get them or run the script under a different logon ID with those rights.
Examples of the errors that can occur when running a Dir command on a file system. We have path length, permissions and memory errors represented.
Solution
My solution uses PowerShell to search using Dir (an alias for Get-ChildItem) for Excel files in a path and then take these and put selected parts into an object which has the full path, file extension and file name and the file name without the extension. This object will contain this listing for all the Excel files found and we will then loop through the object row by row and process each file by first copying it to a local drive and then operating on it. If it fails we see which file it failed on and more importantly we only expose one file to possible visibility of the contents which is important for HIPPA or sensitive data. As the files are looped through, two paths are available for the file to be processed depending on file extension, xls or xlsx, each using a different method to search in the file.
For the xls files we read them into memory using a special encoding which converts the file to a giant string in memory. Once it is a string we can search it with a string index method for "Provider=SQL" and then grab the 256 characters from the index location onward. Why 256 you ask, well the column is 256 characters in Excel for these versions and that is all we need to get the information we want. There may be more to the provider string, but we do not need it and parsing the cell dividers is not fun or easy to get data we do not need. We check to see if the index is greater than zero and output the full pathname, path plus filename, and the 256 characters we grabbed from the string. Finally we delete the file, using -force to make sure a read-only file is removed as well.
The second pathway is for xlsx files, a zip file of xml files that is the new format of Excel files for version 2007 onward. For this type of Excel file, we copy to temporary folder renaming to .zip and then create a COM Shell object to unzip the file into the temporary folder. Once the files are unzipped we then go into the \xl subdirectory and look for the connections.xml file which holds the connection information database, but also Access, Excel and Text for those thinking I wonder if I can find these connections as well. We then read the xml file into memory as an xml object so that we can operate on it with dot path notation. We search for the XML dbPr node and the connection attribute containing "Provider=SQL" if it exists and is not empty. If we find the attribute is not empty we spit the full filename out followed by the attribute value. Once again we have to delete the files and folder structure from our temporary folder, but we have our information and that is what we want.
We now have a way to search inside Excel files for database connection strings and with our output we can target the files that we need not having to open every Excel file to see if there is a database connection. Not perfect as pointed out earlier, but well worth running to save ourselves a lot of time. We do need to make sure that we have rights/permissions to the file system we run against, we can use Explorer to create drive pointers for the folder structure and create the local temporary folders (the script will run, but return nothing if we do not create these beforehand). We also need to have enough memory on the machine we run on to store the results of the directory search in memory, it can grow large if we have tens of thousands of files returned and we are searching millions of files in tens of thousands of directories on a path, so PowerShell needs memory to do this. The last thing to note, this will run on PowerShell version 2, but you really should upgrade to version 3 or higher there is just so much functionality in these versions and as usual there is no special add-ins or modules needed, just PowerShell itself.
PowerShell Code to Find and Search xls and xlsx files
Let's start with the whole PS script and then break down the different areas that we have inside it.
# http://subjunctive.wordpress.com/2008/04/01/powershell-file-search/ # http://stackoverflow.com/questions/8677628/recursive-file-search-using-powershell # http://blogs.technet.com/b/heyscriptingguy/archive/2013/06/24/use-powershell-and-regular-expressions-to-search-binary-data.aspx # http://social.msdn.microsoft.com/Forums/vstudio/en-US/77b0346d-37b3-4840-9df5-e1b74167aac9/byte-array-string-encoding?forum=clr # turn off errors to screen, there will some with regard to path length, file system permissions or out of memory, etc. $ErrorActionPreference = 'SilentlyContinue' # setup search path # search the path putting results in collection in variable $results = Try {dir -path Y:\ -recurse -include (‘*.xls’, ‘*.xlsx’) | Select-Object Fullname,Name,Extension,BaseName } Catch [system.exception] { [system.exception].toString() } # Print number of files found $results.Count # loop through the collection and place results in directory depending on the extension for separate processing later foreach ( $row in $results){ # if xls file process one way otherwise use the other method if ($row.Extension -eq '.xls') { # create path to copy file to temporary directory, which needs to exist $dest1 = 'C:\t5\'+$row.BaseName.ToString()+'.xls' Copy-Item -Path $row.Fullname.ToString() -Destination $dest1 # Get file from temporary directory $Stream = New-Object IO.FileStream -ArgumentList (Resolve-Path $dest1), 'Open', 'Read' # Note: Codepage 28591 returns a 1-to-1 char to byte mapping $Encoding = [Text.Encoding]::GetEncoding(28591) $StreamReader = New-Object IO.StreamReader -ArgumentList $Stream, $Encoding $BinaryText = $StreamReader.ReadToEnd() $StreamReader.Close() $Stream.Close() $ind1 = $BinaryText.IndexOf('Provider=SQL') # $ind1 if ( $ind1 -gt 0) { $row.Fullname.ToString() $BinaryText.Substring($ind1, 256) } Remove-Item C:\t5\* -recurse -force } else { $dest2 = 'C:\t4\'+$row.BaseName.ToString()+'.zip' Copy-Item -Path $row.Fullname.ToString() -Destination $dest2 $shell = new-object -com shell.application $zip = $shell.NameSpace($dest2) foreach($item in $zip.items()) { $dest3 = 'C:\t4' $shell.Namespace($dest3).copyhere($item) } $dest4 = 'C:\t4\xl\connections.xml' if (Test-Path $dest4) { [xml]$xml = Get-Content $dest4 if ($xml.connections.connection.dbPr.connection -ne $null -and $xml.connections.connection.dbPr.connection.Contains('Provider=SQL') ) { $row.Fullname.ToString() $xml.connections.connection.dbPr.connection } } Remove-Item C:\t4\* -recurse -force } }
Let's take a look at the first part before the loop. It has the links to sites I used and the ErrorActionPreference that turns the pesky red screen errors off. You can comment this line out to see any errors kicked out of the script most of which will be in the search portion right below it. The $results object holds the results of the Dir (alias to Get-ChildItem) and will kick out errors on several conditions I described earlier. I do wrap it in a try/catch code block, but it would really only catch the last error and not the intermediate errors. I also output the number of rows in the $results collection object just so you have an idea of how many files you have and do not have to open one by one.
# http://subjunctive.wordpress.com/2008/04/01/powershell-file-search/ # http://stackoverflow.com/questions/8677628/recursive-file-search-using-powershell # http://blogs.technet.com/b/heyscriptingguy/archive/2013/06/24/use-powershell-and-regular-expressions-to-search-binary-data.aspx # http://social.msdn.microsoft.com/Forums/vstudio/en-US/77b0346d-37b3-4840-9df5-e1b74167aac9/byte-array-string-encoding?forum=clr # turn off errors to screen, there will some with regard to path length, file system permissions or out of memory, etc. $ErrorActionPreference = 'SilentlyContinue' # setup search path # search the path putting results in collection in variable $results = Try {dir -path Y:\ -recurse -include (‘*.xls’, ‘*.xlsx’) | Select-Object Fullname,Name,Extension,BaseName } Catch [system.exception] { [system.exception].toString() } # Print number of files found $results.Count
The next part is the loop itself, but it really consists of the two pathways one for xls and the other for xlsx files. So I am going to look at the two paths and trust you understand the looping syntax. The first path is for xls files, we use the extension from the results object for this row as a condition. If it qualifies we create a path to copy the file to using the basename and string for temporary folder then copy it. Next we use the FileStream .NET class to create an object to open the file for reading. The encoding part is where the magic happens, we change from binary to string encoding and use a .NET StreamReader to read the file from disk into memory. Once in memory the IO stream is closed and we search the string for first part of the provider string for a SQL connection getting the index into the string of the first character. The 256 characters from the index position onward are taken out using a .NET substring method if they exist and output to the screen with the full pathname, path plus the filename.
# if xls file process one way otherwise use the other method if ($row.Extension -eq '.xls') { # create path to copy file to temporary directory, which needs to exist $dest1 = 'C:\t5\'+$row.BaseName.ToString()+'.xls' Copy-Item -Path $row.Fullname.ToString() -Destination $dest1 # Get file from temporary directory $Stream = New-Object IO.FileStream -ArgumentList (Resolve-Path $dest1), 'Open', 'Read' # Note: Codepage 28591 returns a 1-to-1 char to byte mapping $Encoding = [Text.Encoding]::GetEncoding(28591) $StreamReader = New-Object IO.StreamReader -ArgumentList $Stream, $Encoding $BinaryText = $StreamReader.ReadToEnd() $StreamReader.Close() $Stream.Close() $ind1 = $BinaryText.IndexOf('Provider=SQL') # $ind1 if ( $ind1 -gt 0) { $row.Fullname.ToString() $BinaryText.Substring($ind1, 256) } Remove-Item C:\t5\* -recurse -force }
The xlsx section below shows a similar method of processing, we copy and change the extension of the file to .zip and then use PowerShell COM object to unzip the file into the temporary folder with the file. Next we read the connections.xml file into an xml variable in memory. We check whether there is a connections.xml file and if it has a dbPr node with a SQL provider section. If these two conditions are met then we output the fullname of the file and the provider information from the dbPr node connection attribute. We finish by deleting the files and folders in the temporary directory and it continues looping to the next row in $results.
else { $dest2 = 'C:\t4\'+$row.BaseName.ToString()+'.zip' Copy-Item -Path $row.Fullname.ToString() -Destination $dest2 $shell = new-object -com shell.application $zip = $shell.NameSpace($dest2) foreach($item in $zip.items()) { $dest3 = 'C:\t4' $shell.Namespace($dest3).copyhere($item) } $dest4 = 'C:\t4\xl\connections.xml' if (Test-Path $dest4) { [xml]$xml = Get-Content $dest4 if ($xml.connections.connection.dbPr.connection -ne $null -and $xml.connections.connection.dbPr.connection.Contains('Provider=SQL') ) { $row.Fullname.ToString() $xml.connections.connection.dbPr.connection } } Remove-Item C:\t4\* -recurse -force
Below is a screenshot of the process running, unzipping the files to the temporary folder for xlsx files. Make sure and create the temporary folders on the local drive or you will need to run the script again.
We can get errors in these copies if there are corrupt elements or different code pages were used on the workstation creating the file than on the one opening it in the script processing. In the graphic below the date is 12/29/1899 which is not correct, but you can click Skip and the processing will continue on. You may consider checking up on the script periodically or assigning someone to watch it in a remote session window, if not it will stop and wait for you to click Skip. There is no simple way around this, but you are getting an automated process to find and search tens of thousands of files.
The end results would look similar to the image below where we have the filename as full path name to file and then the data connection information. The "Data Source=" shows the SQL Server the connection is using.
Next Steps
We can modify the PS script to write to a file or even a SQL table, I am happy with the screen dump, most files do not have a connection, but you may have a different situation. We can also modify the search condition. In xlsx files to find Text file connections we would look for textPr node and the "sourceFile=" attribute. For Access DB we would have dbPr node, but the provider becomes "Provider=Microsoft.ACE.OLEDB" in the connection attribute (version specific for the Excel version). There are similar connection attributes for Excel files and ODBC connections; it really depends on what the node type is and the connection attribute values.
If we look at a connections.xml file of an xlsx format Excel file with multiple types of data sources you can pick out the nodes and the attributes you would have to search for.
For the binary xls format of Excel it is much the same, but more difficult to find what you are looking for in the file. The two Excel files we are looking at are the same, one being the new xlsx format and the other the old binary xls format.
To better understand the connections used we can open the WorkBook Connections dialogue and see the connection names being used.
The issue comes in finding Text and Xml connections in the binary xls format it is not clear in the file where the information is found in the provider section. In the images below we see the Text and Xml connection information, but finding it in the xls file is not easy, in fact the Xml connection may not be in the provider section at all. It may be embedded with the worksheet information for the worksheet it is used in as I found in my Excel file.
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips