Python and Regex for SQL Server ETL with Non-Standard Text Files

By: Tim Wiseman | Comments (3) | Related: More > Import and Export

Problem

SQL Server Integration Services (SSIS) is a remarkably efficient and powerful tool for importing data into SQL Server, but there are times when it is more convenient to use Python to handle non-standard text files due to familiarity with Python or compatibility with preexisting code. How can Python be used to handle ETL tasks for SQL Server with non-standard text files?

Solution

Why use Python for ETL?

SQL Server Integration Services (SSIS) is supplied along with SQL Server and it is an effective, and efficient tool for most Extract, Transform, Load (ETL) operations. It has proven itself versatile and easy to use. Because of its Script Component, SSIS also handles non-standard text file formats and can even handle more complicated mathematical processing before inserting the data into SQL Server. It is rightly a tool that many DBAs and others who work with SQL Server rely on in their daily work flow.

While SSIS is a fantastic ETL program, there are times when it makes sense to use other tools. If you are integrating the ETL task into a larger program, it may be better to write the ETL portion in the primary language being used for the project. If you are doing complex transformation before inserting the data you may already have custom functions or even libraries of functions that will be useful, and so you may want to do the ETL in the language that those libraries are written in. There may also be times you need to do an ETL task without having SSIS available, especially when dealing with a small installation of SQL Server Express which may not have all of the management tools included. Additionally, there have been times when I started a project with the intention of it being single user and used SQLite only to then migrate to the more powerful SQL Server as the demands placed on the project grew with time, and it requires only minor changes to do that when using Python.

As a popular, easy to use programming language which often requires little development time to create simple scripts, Python is often a good choice for doing ETL tasks with SQL Server. In this tip, we will be looking specifically at using Python to import a file which has its headers inline with the data and to import a file without clear demarcations.

Headers Inline

To start with, I will use an example named example.txt like this:

/FName Pythagoras /LName Samos /Phone 141-421-3562
/LName Einstein /FName Albert /phone 299-792-4580
/phone 161-803-3988 /lname Penrose /fname Roger
/lname Syracuse /fName Archimedes /phone 314-159-2653
/PhOnE 271-828-1828 /lNaMe Euler /fNaMe Leonhard
/lNAME Cyrene /fNAME Theodorus /phone 173-205-0807

This example deliberately ensures that the order the fields appear in as well as the capitalization of the field identifiers varies. We will also assume we have a fairly straightforward table to store the data like:

CREATE TABLE People
	(FName VARCHAR(30) PRIMARY KEY,
	LName VARCHAR(30),
	Phone VARCHAR(12)
	)

We will start by providing the entire example script to do this and then go over the key portions of it.

sqlConnStr = ('DRIVER={SQL Server Native Client 11.0};Server=YourServer;Database=Test;'+
            'Trusted_Connection=YES') 


import pyodbc #to connect to SQL Server
import re
            
sqlConn = pyodbc.connect(sqlConnStr, autocommit = True)
curs = sqlConn.cursor()

#Set the Regex patterns
#(?i) tells it to ignore case
#(<=) is a lookbehind to check for, but not include, the next part
#[^/]+ then matches any characters until it comes to the end of the line or a /
fNameRe = re.compile(r'(?i)(?<=/FName )[^/]+')
lNameRe = re.compile(r'(?i)(?<=/LName )[^/]+')
#not making sure it is actually a phone number, just that it follows /phone
phoneRe = re.compile(r'(?i)(?<=/phone )[^/]+')

with open('example.txt') as inF:
    for line in inF:
        fName = fNameRe.findall(line)[0]
        fName = fName.strip() #remove any trailing spaces or line breaks
        lName = lNameRe.findall(line)[0]
        lName = lName.strip() 
        phone = phoneRe.findall(line)[0]
        phone = phone.strip()
        
        #Now generate the SQL to insert into the database
        #parameterize it both to encourage query plan reuse
        #and to protect against potential SQL Injection in the file
        sql = """INSERT INTO dbo.people (FName, LName, phone)
        values (?, ?, ?)"""
        
        curs.execute(sql, fName, lName, phone)

I tried to make this script short and to the point rather than complete. For that reason, I left out error trapping or consideration that a line in the file may be malformed. I also tried to comment it relatively thoroughly, but there are a few things worthy of examination.

The script makes use of two key libraries, pyodbc and re. Pyodbc is my preferred way to access SQL Server and some other ODBC data sources from Python, but it is hardly the only method to do it. Re is the Python implementation of regular expressions, or regex. Regex is a powerful, though sometimes somewhat arcane, library for textual pattern matching. This script uses three regex expressions which fall into a pattern and look like:

(?i)(?<=/FName )[^/]+

This can be broken down into several parts. The first (?i) sets a flag for the regular expression engine and tells it that for that pattern it should ignore case. This could have been set as a flag for the compile command itself instead of being made part of the expression. The next part (?<=/FName ) is a look behind expression. It tells the regex engine to only match something that comes after that expression. So, the /FName itself will not be returned, but the portion that comes afterwards will be. Then [^/]+ does the main work and tells it to match every character it comes across until it reaches a forward slash. More specifically, bracketing a set of characters tells it to match any of the characters inside the brackets. Starting a bracketed set with a ^ tells it to match any character except those in the bracket. The + afterwards tells it to keep repeating that process. This might all be clearer with some examples using the interactive Python shell.

>>> import re #enable REGEX in this session
>>> line1 = '/FName Pythagoras /LName Samos /Phone 141-421-3562'
>>> line2 = '/PhOnE 271-828-1828 /lNaMe Euler /fNaMe Leonhard'
>>> fNameRe = re.compile(r'(?i)(?<=/FName )[^/]+') #start with it just as it is in the script
>>> fNameRe.findall(line1)
['Pythagoras ']
>>> fNameRe.findall(line2) #location in the string doesn't matter, just the pattern
['Leonhard']
>>> caseSensitive = re.compile(r'(?<=/FName )[^/]+') #Take out the flag that makes it case insensitive
>>> caseSensitive.findall(line2)
[]
>>> NoPlus = re.compile(r'(?<=/FName )[^/]') #Remove the plus, it will only take one character
>>> NoPlus.findall(line1)
['P']
>>>

The actual SQL is executed with curs.execute which is fully parameterized which can in some cases help with execution plan reuse and provide some minimal protection against SQL injection or some unintentional errors in the dataset. You may also notice that I set autocommit to true and send the insert statement to SQL Server every time a row is processed. Depending somewhat on the data set it may be preferable to store up the values into a list of tuples or similar data structure and then use executemany with a separate call to commit the data. But in this sample script I was emphasizing readability and reliability over any thoughts of efficiency.

No Clear Demarcations

Regex is quite effective where there are no clear value separators if the type of data lends itself to being identified by the structure of the data itself. For instance, we may have a list, in companies.txt, of company names with a contact phone number and e-mail address that might look like this:

Made Up Company [email protected] 123-456-8910
Lightspeed Inc. [email protected] 299-792-4580
Gravitational Constant LLP [email protected] 667-384-8000
Planck LLC [email protected] 105-457-1726

There is no consistent delimiter, such as commonly used commas or tabs, to show the end of one field and the beginning of the next. There is a space separating each field, but spaces also appear as part of at least the company name field so we cannot reliably use that to separate them. So instead, we can identify each part of it by the properties of the data itself. First, we'll make a table to store the test data like:

CREATE TABLE Companies
	(CompanyName varchar(50),
	EmailAddress varchar(50),
	Phone varchar(12)
	)

Then we could import it with a script like:

sqlConnStr = ('DRIVER={SQL Server Native Client 11.0};Server=YourServer;Database=Test;'+
            'Trusted_Connection=YES') 


import pyodbc #to connect to SQL Server
import re
            
sqlConn = pyodbc.connect(sqlConnStr, autocommit = True)
curs = sqlConn.cursor()

#set the two patterns

#\d matches any digit {3} and {4} tell it to find exactly 3 or 4 of the 
#of the previous expression, and the dashes are matched literally.
phoneRe = re.compile('\d{3}-\d{3}-\d{4}')

#This time setting the flag outside the pattern just as an example.
emailRe = re.compile(r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}', re.I)

with open('companies.txt') as inF:
    for line in inF:
        email = emailRe.findall(line)[0]
        phone = phoneRe.findall(line)[0]

        #To get the company name, take the entire part of the line that comes
        #before the email        
        emailStart = emailRe.search(line).start()
        companyName = line[0:emailStart-1]
       
        sql = """INSERT INTO dbo.Companies (CompanyName, EmailAddress, Phone)
        values (?, ?, ?)"""
        
        curs.execute(sql, companyName, email, phone)

This is similar to our previous script in most ways. But here, rather than using look ahead matches for the email and phone we have the regex engine match the portion we want directly. The company name does not have any consistent pattern that we can match, but its location is always consistent, so we can use emailRe.search(line).start() to find the location of the start of the e-mail address and then take a slice from the line going from the first character of the line to just before the e-mail address starts. In this way, we break out all of the portions of each line we want.

Conclusion

I have frequently found Python with REGEX to be highly useful in extracting data from text files that were structured in unusual ways. Python has also served me well in doing some preprocessing of data, such as standardizing units, before importing it into SQL Server. Although there are many tools for doing complex ETL, Python's flexibility and ease makes it an excellent tool for certain types of ETL situations.

Next Steps

Tim Smith (who has an excellent first name) has a series of tips on PowerShell and Text Mining that makes extensive use of regex.
Andy Novick provided an article on SQL Server Function to Validate Email Addresses which uses Regex with SQL CLR.

About the author

Tim Wiseman is a SQL Server DBA and Python Developer in Las Vegas.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips

Using Python and Regex for SQL Server ETL Needs with Non-Standard Text Files