Learn more about SQL Server tools

mssqltips logo
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Scripts          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

 

PowerShell and Text Mining Part I: Word Counts, Positions and Libraries


By:   |   Last Updated: 2014-10-16   |   Comments   |   Related Tips: More > T-SQL

Problem

We've been evaluating text mining and are currently considering different approaches on how we can import and analyze the SQL Server data. As an organization familiar with numerical data, we have little experience with the basics of text mining for SQL Server and are curious about some of the basics and basic applications of text mining.

Solution

One of the most popular questions I receive relates to text mining, because we've seen an increase in "social" data (for example Amazon reviews), and some organizations see this as an opportunity to make various predictions. In this tip I will cover simple examples and approaches for organizations unfamiliar with text mining and automation, though we can take these same approaches and add complexity for advanced analysis. Like the trouble of having too much numerical data, we should also be careful with textual data because we can find plenty of noise, or over-analyze issues to a point where we no longer are acting on our findings.

I'll cover three simple approaches to text mining with PowerShell - word counts, positions and the use of a third party library tool.  I'll focus on an effective approach using PowerShell and data from text files.

If we read web pages, social media updates or other items we can save that data to files for processing or alter the below process to read directly from the web pages.  For instance here is a process to read tweets:

$all_tweets = ""
foreach ($tweet in $tweets)
{
 $all_tweets += "  " + $tweet
}

The above builds a long string called $all_tweets that we can mine for word counts and positions, by looping through an array (or list) of tweets. We can also do the same with blog posts, social media updates, web page posts, etc., as the logic would not need to change.

Processing Data and Getting Word Counts

In this word count example, we will:

  1. Read the text from a text document.
  2. Name our table, which will store the data from the text document.
  3. Store results from the file that we've read in our table.

First, we will build an empty hash table and read from a file using Get-Content and applying some basic RegEx to strip unimportant characters, such as commas, parenthesis, periods, quotes, etc. because we're looking at words and their counts, not the context and punctuation. If we were to skip the RegEx part, we need to know that "done" (let's say appears 7 times) and "done." (appears 2 times) would be possible, when what we're ultimately looking for is the count of "done" which should be 9.

The below code strips these characters from the words.

$nl = [Environment]::NewLine
$wrdc_mining = @{}
 
$agg_snt = Get-Content "C:\files\Other\fedstatement.txt"
$agg_snt = $agg_snt.ToLower() -replace "[.,!?)(]",""

Of course, there is one catch - the apostrophe. As TSQL developers know, the apostrophe escapes strings, for instance:

INSERT INTO OurTable VALUES ('The quick brown fox jumped over the lazy dogs.')
INSERT INTO OurTable VALUES ('Why did you steal the dog's bone?')

The first insert will work, but the second INSERT will fail because "dog's" has an apostrophe and it escapes the string, thus SQL Server spits out, Incorrect syntax near 's' because we failed to handle the apostrophe. So in order to handle this one apostrophe becomes two as shown below.

INSERT INTO OurTable VALUES ('Why did you steal the dog''s bone?')

What does this mean when we mine a file? Consider that its and it's don't mean the same thing. We could just strip all the apostrophes, and if we don't care about some key exceptions, this might be less of a headache because most of us understand that wont was really won't. But since words like its and it's exist, we can apply some RegEx principles to handle these, as these words fundamentally involve an alphabetic character (or two) and an apostrophe, and we only want to add an apostrophe in these instances; all other instances, we want to strip apostrophes.  Here is the code to handle keeping some apostrophes and removing ones we do not want.

$words = "'Twas the night before Santa's party, when the elves' tools ' malfunctioned in the shop."
$words = $words.ToLower() -replace "[.,!?)(]",""
$words = $words -replace "([^a-z])'([^a-z])","`${1}`${2}"
$words = $words -replace "'","''"

Outside of incredibly bizarre English (always a possibility), this will keep our valid apostrophe words and help us demarcate between words like its and it's, while removing apostrophes that don't add value, or are randomly placed in sentences.

Once we remove all unnecessary characters, we will then break the long string apart by spaces, store each word with a count, and check if the word already exists in the hash table. If the word exists, we will increment the count, remove the word, and store the word again with the new incremented count. We do this step because otherwise the hash will keep adding the value as new with 1, which won't be accurate and defeats the purpose of counting words. We could also remove certain words (or prevent them from being added) in this step, though I caution developers from doing this because some words that people perceive as being "filler" words can actually tell you a lot about the context when you're performing analysis. Like with raw data, I prefer to keep the raw data, then apply filters.  Here is the code to get a count of each word.

foreach ($iwc in $agg_snt.Split(" "))
{
 if (($wrdc_mining[$iwc]) -ne $null)
 {
  $cnt = $wrdc_mining[$iwc] + 1
  $wrdc_mining.Remove($iwc)
  $wrdc_mining.Add($iwc, $cnt)
 }
 else
 {
  $wrdc_mining.Add($iwc, 1)
 }
}

Now that our hash table is built with words and their counts, we want to insert the data into SQL Server. Since a hash table operates with a key-value store, we can grab each of the keys and their values, in this case the keys will be the words and the values will be the count of the words. Each of these (the keys and values) are added to a T-SQL insert statement. Since I wrap this in a function where the table is specified, I use an object called $table to create the table and insert the data. We'll check if the table exists before continuing, create it if it doesn't. I prefer to avoid keeping my data in the same table - for instance, News would be in its own table and Federal Reserve data would be in its own table. When mining the text of a new data source, we will want to create a table with the word, word count and date, whereas an existing data source will use the existing table.

$sql_insert = "IF OBJECT_ID('$table') IS NULL BEGIN CREATE TABLE $table (Word VARCHAR(250), WordCount INT NULL, WordDate DATE DEFAULT GETDATE()) END" + $nl

foreach ($k in $wrdc_mining.Keys)
{
 $sql_insert += "INSERT INTO $table (Word, WordCount) VALUES ('$k'," + $wrdc_mining[$k] + ")" + $nl
}

In our first example, we will have two separate tables and this example tends to make sense in data contexts where the analysis on text data tends to involve more than just how the data have changed from previous time periods. For instance, from a data analysis perspective, what isn't mentioned is often as interesting as what is mentioned and this would be an example of analysis unconcerned with changes in time. In our second approach, we store the data by time, where our table has a default value of GET_DATE() when the data are imported on the column WordDate. In Part II, I'll show some examples of some T-SQL we can use to look at the change in word frequency over time. Which ever route we choose, we will want to then insert the data into our table.  Here is the code that allow us to connect to SQL Server and insert the data.

$scon = New-Object System.Data.SqlClient.SqlConnection
$scon.ConnectionString = "SERVER=" + $server + ";DATABASE=" + $db + ";Integrated Security=true"

$adddata = New-Object System.Data.SqlClient.SqlCommand
$adddata.Connection = $scon
$adddata.CommandText = $sql_insert

$scon.Open()
$adddata.ExecuteNonQuery()
$scon.Close()
$scon.Dispose()

Complete PowerShell Function

After adding the final piece above, we can create a function to call for repetition:

Function TextMining_WordCounts($file, $table, $server, $db)
{

    $nl = [Environment]::NewLine
    $wrdc_mining = @{}
 
    $agg_snt = Get-Content $file
    $agg_snt = $agg_snt.ToLower() -replace "[.,!?)(]",""
    $agg_snt = $agg_snt -replace "([^a-z])'([^a-z])","`${1}`${2}"
    $agg_snt = $agg_snt -replace "'","''"

    foreach ($iwc in $agg_snt.Split(" "))
    {
     if (($wrdc_mining[$iwc]) -ne $null)
     {
      $cnt = $wrdc_mining[$iwc] + 1
      $wrdc_mining.Remove($iwc)
      $wrdc_mining.Add($iwc, $cnt)
     }
     else
     {
      $wrdc_mining.Add($iwc, 1)
     }
    }

    $sql_insert = "IF OBJECT_ID('$table') IS NULL BEGIN CREATE TABLE $table (Word VARCHAR(250), WordCount INT NULL, WordDate DATE DEFAULT GETDATE()) END" + $nl

    foreach ($k in $wrdc_mining.Keys)
    {
        $sql_insert += "INSERT INTO $table (Word, WordCount) VALUES ('$k'," + $wrdc_mining[$k] + ")" + $nl
    }

    $scon = New-Object System.Data.SqlClient.SqlConnection
    $scon.ConnectionString = "SERVER=" + $server + ";DATABASE=" + $db + ";Integrated Security=true"

    $adddata = New-Object System.Data.SqlClient.SqlCommand
    $adddata.Connection = $scon
    $adddata.CommandText = $sql_insert

    $scon.Open()
    $adddata.ExecuteNonQuery()
    $scon.Close()
    $scon.Dispose()
}

TextMining_WordCounts -file "C:\files\OurFile.txt" -table "OurTable" -server "OURSERVER\OURINSTANCE" -db "OURDATABASE"

Now that we have our function, we mine files for words, get counts, and log the data. For instance, suppose that we received a text file called "DailyNews.txt" which came from various news' sites, we could call our function as follows:

TextMining_WordCounts -file "C:\files\DailyNews.txt" -table "DailyNews" -server "OURSERVER\OURINSTANCE" -db "OURDATABASE"

Each day, it would perform the word counts and store the results in the table with the dates logged. We could also scrape a specific news site, store results, and evaluate if we want to continue using the source of news for information, because we can now track (using word counts) exactly what the site thinks are important topics. We could also avoid using some news sources in particular if we want to avoid certain types of news. A few people I've spoken with only want objective news and use word counts to determine how many emotional verbs a site uses and avoid using the site if it crosses a certain total words to emotional words ratio. These simple applications of counting words show that mining words and counts can offer us (and clients) some tools, and we can adjust them depending on what we need to achieve.

Word Positions

Similar to the word count approach, we can also look at word position. The below function is similar to the above function in how it reads the data and differs with what it inserts:

Function MineFile_WordPosition ($file, $srv, $db)
{
    $nl = [Environment]::NewLine
    $sn = Get-Content $file
    $sn = $sn -replace "'","''"
    $sn = $sn -replace "[,;:]",""
 
    $cnt = 1
    $add = "IF OBJECT_ID('Word_Stage') IS NOT NULL BEGIN DROP TABLE Word_Stage END CREATE TABLE Word_Stage(Word VARCHAR(250), WordPosition INT, StatementDate DATETIME DEFAULT GETDATE())"
    foreach ($s in $sn.Split(" "))
    {
        if ($s -ne "")
        {
            $add += $nl + "INSERT INTO Word_Stage (Word,WordPosition) VALUES ('$s',$cnt)"
            $cnt++
        }
    }
 
    $scon = New-Object System.Data.SqlClient.SqlConnection
    $scon.ConnectionString = "SERVER=$srv;DATABASE=$db;integrated security=true"
 
    $cmd = New-Object System.Data.SqlClient.SqlCommand
    $cmd.CommandText = $add
    $cmd.Connection = $scon
 
    $scon.Open()
    $cmd.ExecuteNonQuery()
    $scon.Close()
    $scon.Dispose()
}

MineFile_WordPosition -file "C:\files\Other\fedstatement.txt" -srv "TIMOTHY\SQLEXPRESS" -db "MSSQLTips"

This function mines a file by counting the position of the word. For instance, in the sentence "The quick brown fox jumped over the lazy dogs" the word "the" is in position one and seven. This function can become very tricky: people of various demographics do not (1) use similar language structure, (2) follow grammatical rules, and (3) may create problems for text miners. For this reason, I am very careful about applying strict RegEx, as I may miss the context of the statement (or word).

For instance, this., this!, and this? are not the same in my analysis; neither is a lot and a A LOT in the following paraphrased statement from a young person summarizing a quote from Keynes: "When you owe the bank a lot of money, you're in trouble. When you owe the bank A LOT of money, they're in trouble." If we're too strict in our application, we may begin to lose the emphasis and context, considering that groups differ in how they place it (not everyone uses bold or italics). I write this to caution developers on applying overly-strict solutions, depending on what you're wanting to achieve - it may make sense in certain cases, and in other cases not make sense. The fundamental key here is that you know why you're mining the way you're choosing to do it; this concern is also why I tend to avoid third party tools - I find their assumptions invalid; again, in your case, these tools may make sense.

This function inserts the data into a staging table and from here, we'll execute a stored procedure to relationally map our data:

CREATE TABLE Word_ByPosition (
 StatementDate DATETIME,
 WordID BIGINT,
 WordPosition INT
)

CREATE TABLE Words (
 WordID BIGINT IDENTITY(1,1),
 Word VARCHAR(250) NULL
)

CREATE PROCEDURE stp_AddWordPositions
AS
BEGIN

 INSERT INTO Words (Word)
 SELECT DISTINCT Word
 FROM Word_Stage
 WHERE Word NOT IN (SELECT Word FROM Words)
  
  
 INSERT INTO Word_ByPosition (StatementDate, WordID, WordPosition)
 SELECT w.StatementDate
     , ww.WordID
     , w.WordPosition
 FROM Word_Stage w
     INNER JOIN Words ww ON w.Word = ww.Word


END

In the above procedure, we add words that do not exist to our Words table, insert into our Words_ByPosition table the statement date, word ID, and the position of the word. Text miners have a tendency to mine data for sentiment (how effective is it? VERY EFFECTIVE - is that sarcasm or not?), and the above solution with position is one way to do so. We can also look at other patterns with this analysis too - what words are used in the first one hundred words and how does this change over time.

Using the above, we now have some simple tools at our disposal. We can test theories and add these tools to larger solutions; for instance, imagine an algorithm that reads error log data, reads from a construction of solutions, applies each to the problem until it is eliminated. At the basic and fundamental level (just like with our brains), it must be able to read each word individually and understand the context of the word as a part of the entire error.

Third Party Library Tool

Another approach we can use to text mining is using a tool provided by an open source library, WordNet SQL Builder, which comes with its own how-to and query page (the latter being helpful to understand how to use it) and some examples for developers. For this third party library, I'll cover how to install this to SQL Server, as you may notice that the downloads pertain mostly to other SQL engines, like SQLite, PostgreSQL, etc., which we can use when using those engines. This third party tool is fairly comprehensive and you should definitely consider using it if you want to dig into text mining and go deep; for beginners, I'd recommend playing with the above approaches to build familiarity with what you can do.

To obtain the third party tool for SQL Server, Troy Simpson provides us with the MDF and LDF files at this link in a zip file. To add this database to SQL Server:

  1. Unzip the MDF and LDF files.
  2. In SQL Server, right click on the folder databases.
  3. Select the option, "Attach..."
  4. Click "Add" and point to the MDF and LDF file.
  5. Double check the names of the data file, log and database to match your naming convention.

You can also run the below script, though this assumes the data and log file are on different drives (D for data and F for log), so you'll want to adjust this script to point to the right place if your data and log files are on different drives:

USE [master]
GO

CREATE DATABASE [wordnet] ON 
( FILENAME = N'D:\Data\wordnet.mdf' ),
( FILENAME = N'F:\Log\wordnet_log.ldf' )
 FOR ATTACH
GO

The database is compatible with SQL Server 2005 and is approximately 200 MB in size; I've tested with SQL Server 2008R2 with no issue. In addition, it comes with tables that have slightly different names than the ones found in the reference from the how-to and query page, though a lot of the same logic can be deduced by looking at examples and playing with queries. For starters, I'd suggest the below query to get a basic feel of the tool and consider where it might be useful, or consider what pieces of this you could also build for your own text mining tool:

SELECT *
FROM Word w
 INNER JOIN sense s ON w.wordid = s.wordid
 INNER JOIN synset sy ON s.synsetid = sy.synsetid
 INNER JOIN categorydef c ON c.categoryid = sy.categoryid AND c.pos = sy.pos

We've covered some basics in text mining, such as obtaining word counts and positions, as well as imported and advanced third party tool. In part two, we'll look at some applications using the above approaches to text mining.

Next Steps
  • Import data from text files.
  • Test what other text data should be removed; for instance "<" and ">" characters?
  • Copy something you've written to a text file and import it. What words do you expect that you use a lot and what were your results?


Last Updated: 2014-10-16


get scripts

next tip button



About the author
MSSQLTips author Tim Smith Tim Smith works as a DBA and developer and also teaches Automating ETL on Udemy.

View all my tips




Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Email me updates 


Signup for our newsletter
 I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.



    



Learn more about SQL Server tools