Consider New Ideas for SQL Server Data Analysis

By: Tim Smith | Comments | Related: > PowerShell

Problem

One of the most popular questions I've been asked relates to how we look at data. Most are derivatives of the question, "What are some other ways I can look at these data," whether that's in the form of other dimensions, correlations, comparable data sets, etc. One of the strongest investments anyone can ever make is thoroughly investigating the answer to this because you may identify invaluable patterns that haven't been recognized.

Solution

The three most useful approaches to thinking about how we look at data, like thinking of other dimensions, involve functions which measure a de-normalized data set, finding experts in the specific field, and understanding the assumptions behind data.

Identifying Category and Range Functions

When looking at a de-normalized table, we can run the below script against it and evaluate the distinct values. One function loops through the columns of a table on the dbo schema, and obtains the distinct count of the values in each of the columns, the other simply returns the output provided that the count is less than or equal to what we set as the limit. This serves a few purposes for thinking about other dimensions. The first is that we quickly learn which column of data has fewer values than others, especially if we're looking for a narrow limit, like 100. If we consider that only 50 states exist in the United States, any data set involving the United States should evaluate each state. Also, when we identify these limited columns, we can possibly break them down into smaller values (like county or zip code, in this case) or larger values (like region, in this case). Initially, this function identifies what else we could look evaluate. Contrast that with a column that has sixteen billion distinct rows; we may not build a dimension for that column.

Function Read-Output {
    Param (
    [string]$server
    , [string]$database
    , [string]$readquery
    )
    Process
    {
        $scon = New-Object System.Data.SqlClient.SqlConnection
        $scon.ConnectionString = "Data Source=$server;Initial Catalog=$database;Integrated Security=true;Connection Timeout=0"
 
        $cmd = New-Object System.Data.SqlClient.SqlCommand
        $cmd.Connection = $scon
        $cmd.CommandText = $readquery
        $cmd.CommandTimeout = 0
   
        $scon.Open()
        $sqlread = $cmd.ExecuteReader()
        
        while ($sqlread.Read())
        {
            $returnvalue = $sqlread.GetValue($value)
        }

        $scon.Dispose()
        return $returnvalue
    }
}


Function Get-Columns {
    Param (
    [string]$server
    , [string]$database
    , [string]$table
    , [int]$limit
    )
    Process
    {
        Add-Type -Path "C:\Program Files (x86)\Microsoft SQL Server\100\SDK\Assemblies\Microsoft.SqlServer.Smo.dll"

        $srv = New-Object Microsoft.SqlServer.Management.SMO.Server($server)

        foreach ($column in $srv.Databases["$database"].Tables["$table"].Columns | Where-Object {$_.Name -notlike "*id*"})
        {
            $colname = $column.Name
            $distinct_query = "SELECT COUNT(DISTINCT $colname) FROM $table"
            $distinct_values = Read-Output -server $server -database $database -readquery $distinct_query
            
            if ($distinct_values -le $limit)
            {
                Write-Host "Column, $column, has $distinct_values distinct values."
            }
        }
    }
}

Get-Columns -server "OURSERVER\OURINSTANCE" -database "OurDatabase" -table "DeNormalizedTable" -limit 250

For other dimensions, we can also consider ranges, which are typically used for bucket purposes, such as a range of 50-100g of protein consumed a day, or a range of 7-10 miles driven per commute. In the past, I've used algorithms which automatically create star schemas and take a guess at range values, but unfortunately, even upon adjusting them, they make some interesting assumptions and human oversight still matters. A good example of this in practice is the 200 simple moving average; a developer might think this addition could assist clients, but they might find it less meaningful. Likewise, the range of how much protein is consumed may be more important than the range in difference of carbohydrates to protein or vice versa from one researcher to another. The below queries simply returns the numeric columns, excluding ID columns, and the minimum and maximum of those values. Note that we could automate this and save it to a table, but it may still require oversight, so I prefer looking at each on a case-by-case basis. Whichever route you find appropriate, these can make good range values when adding new dimensions, or looking at data differently.

---- Individual table:
SELECT COLUMN_NAME ColumnName
 , NUMERIC_SCALE NumericScale
 , TABLE_NAME TableName
 , 'SELECT ''' + COLUMN_NAME + ''',  MIN(' + COLUMN_NAME + ') MinValue , MAX(' + COLUMN_NAME + ') MaxValue FROM ' + TABLE_NAME AS Query
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'DeNormalizedTable'
 AND DATA_TYPE IN ('smallint','int','bigint','decimal','numeric','float','real','money')
 AND COLUMN_NAME NOT LIKE '%id%'
 
---- Multiple tables:
SELECT COLUMN_NAME ColumnName
 , NUMERIC_SCALE NumericScale
 , TABLE_NAME TableName
 , 'SELECT ''' + COLUMN_NAME + ''',  MIN(' + COLUMN_NAME + ') MinValue , MAX(' + COLUMN_NAME + ') MaxValue FROM ' + TABLE_NAME AS Query
FROM INFORMATION_SCHEMA.COLUMNS
WHERE DATA_TYPE IN ('smallint','int','bigint','decimal','numeric','float','real','money')
 AND COLUMN_NAME NOT LIKE '%id%'

Let's look at an example of how we could use the above functions on a data set involving wealth differences in the United States. Below, I've linked to a few data sets (free) that you can look at, which have some dimensions organized for us. Let's think about the current dimensions we see, as well as some we don't see that might provide us with value before we come to conclusions.

Pew Blog Post: America's wealth gap between middle-income and upper-income families is widest on record

Wikipedia Chart Summarizing the Fed's Consumer Survey

As we can see from link one, the upper income wealth has grown faster than the middle income. If we look at the chart summary on Wikipedia (year 2013), we might immediately consider two dimensions if we had the full denormalized data with values such as wealth, income, birth year, education level, region, etc.: age and education level, so let's look at both on the chart. We can see that the median wealth for people aged 55-64 is $165,720 versus $10,460 for people younger than 35. Also we see that median wealth for those college educated is $218,720 versus $17,250 for people without a high school diploma. Consider that in both of these cases, a person can be at different points in their life (not in the same study at one specific time). 55 year old Dr. Bob Smith at one time was less than 35 and didn't have an M.D. or high school diploma, so he may have been a part of the lower wealth of an earlier study, even though later he was on one of highest levels of income on a later study. These values, age and education, would come from possibly looking at distinct values less than 150-200, and we may start to see a pattern, which we can see on the Wikipedia chart. Now, let's go back to the Pew post and ask, "What is the median age for the upper income household versus the median age for the middle income household as well as well as what is the difference in education levels of the two groups?" If we received a denormalized data set with wealth, income, birth year, education level, region, etc. in one table, by applying our above functions we could have identified both age and education as dimensions to look at the data, as both would be less than 200 (education is often be reported as a range, like level).

We could look at other dimensions as well, such as region in the United States, home owner versus renter, as well as identify variables that may not be evaluated, such as the professional experience level of the people (a good question to ask here, "What's missing?"). With a denormalized data set, relative to the limits we set, our functions would tell us what these values are and we could evaluate the impact on the data. For some clients, we may do everything from development to offering suggestions about data, so developing familiarity with reading data and cultivating ideas from looking at the data will provide you with many benefits with any set of data.

The Value of Experts In the Field

SQL Server offers a strong community, and it's a useful reminder that communities even outside development exist that can help us add value to our clients. If you work in a specific field, like finance, gaming, healthcare, etc., you can also meet others who are experts in these fields and learn valuable information from them, like you learn valuable information from Microsoft MVPs. Some examples to evaluate, possibly challenging what we consider an expert:

From reviewing the portfolios of Warren Buffett, Carl Icahn, and other large investors, few saw the collapse in oil prices coming, though the young economic analyst Jesse Colombo did. This was not the first time one of his economic predictions came to fruition.
For decades, some of the smartest health care academics have been advocating a low fat diet. Yet, we're seeing concessions that this may have been wrong. Who's right, what do the data show, and who are the experts? These are very difficult questions to answer.
Two of the members of Long Term Capital Management were Nobel Prize winners.

Experts demarcate themselves from others by obtaining results in their field. Development and mathematics' skills will move you far in your career as will investing time with people who are at the top of their field. SQL Server offers a strong community and depending on the sector of your work, you can find other communities that will feature those in your profession as well. For instance, if you work in healthcare, find healthcare groups where people meet and invest time with them; this makes you more valuable to your profession and others in your profession in addition to your employer (or clients).

Know the Assumptions Behind Data

Finally, when we look at data, we should be careful about the limits of data and measurements and whether they are even useful in certain situations. One favorite idiom of mine, called Goodhart's Law, is "When a measure becomes a target, it ceases to be a good measure" and is one example of the reminder that behind every number is an assumption, and if the assumption is invalid, the number is irrelevant. This isn't a caution to abandon data; quite the opposite, know what assumptions lie beneath your data and be able to elucidate them and their impact. Consider the example of the statement "high protein consumption can lead to kidney failure" - as data developers, we may look at protein consumption to evaluate the effects. We may also theorize that inflammation is a factor, and look further into it. Or, we may theorize that genetics is more likely to be the basis of kidney failure and look at it. In all of these cases, the meaning involves the assumptions; if all of the assumptions are wrong, and kidney failure is caused by something we haven't discovered, the data may we have might be irrelevant due to the possibility that the data are coincidental.

Next Steps

The above functions offer the most utility on a de-normalized data set, which make them useful after importing data.
Depending on the problem you're solving, where would experts meet and discuss ideas?
In what way are you not looking at your data set?

About the author

Tim Smith works as a DBA and developer and also teaches Automating ETL on Udemy.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips