Considering Derived Values In ETL Flows

By: Tim Smith | Comments | Related: More > Import and Export

Problem

We use the previous decade as our data template for the current decade to help us execute decisions. In our transformation ETL steps, we sometimes want to add derived values based on the previous decade's average to our final report where we're comparing the current value to the average of the set. Sometimes, we use these derived values for other estimates that don't end on our final reports. We may sometimes manipulate the current or average value too used in these derivatives and right now we do this inefficiently through spreadsheets. How can we efficiently do this in our transformation steps without adding to many extra steps or use too many resources?

Solution

For this solution, we'll look at adding the derived values in the final step along with adding a possible configurable value for our report. We'll assume that we don't want these derived values based on the average stored, though we will see an alternative where we can also add these for storage if our situation calls for this. We also want the base values to determine how many derived values exist, so that our base to derivative values is always 1-1.

In general, a derived value is correlated with a base value, such as a moving average or a person's weight in kilograms converted to pounds. The base is the starting value (or values) used to create the derived value in the form of a function. The amount of oil from a pump in a given day is not a derived value, whereas the average amount of oil for a week is a derived value from the previous 7 days. A person's weight in pounds (or kilograms) is not a derived value, but if we convert it to another measurement from its original measurement, it is a derived value. This means that anytime we're adding a derived value, we must consider the base value or values. For a set of base values, this means that we may have a derived value per base value, such as the average of the previous 7 days of oil extraction continuously tracked on the latest value (the original 6 values would not have this since there wouldn't be any previous 7 values). For this problem, we'll use a derived value comparing the current value to the average of the set, which means that all current values will also have a derived value associated with them.

For this example, we'll create a test data set using a loop that creates 144 values, or 1 value for each month in the past ten years with the date and a random value that we'll be using for deriving values based on aggregates. In the below loop, we use the DATEDIFF function to subtract the looping value from the current date of December 15, 2010. In addition, we're inserting this time value along with a random value into our example table. From this example data, we will create a CSV file that we will use for a bulk insert later.

CREATE TABLE eltInitial(
	RepDate DATE,
	RepValue DECIMAL(13,4)
)

DECLARE @round INT = 1, @valueMult DECIMAL(5,2)
WHILE @round <= 144
BEGIN
	SET @valueMult = (100.0*RAND())
	INSERT INTO eltInitial (RepDate,RepValue) VALUES (DATEADD(MONTH,@round,'1999-12-15'),@valueMult)
	SET @round = @round + 1
END

SELECT * 
FROM eltInitial

Because some systems will generate extra characters when we click Save As (as shown in the image), I will use a PowerShell import over BULK INSERT in this tip. In cases where numeric data needs extra effort to convert correctly, my general preference is to read the data using .NET as it's a simple object declaration over converting. This is only a preference and this may not be required. Some systems will not experience this problem when generating a CSV, so a BULK INSERT in the later step will function fine. We now have our test data set we'll use for this example - a source table of data (eltInitial) for one example and a data file for another example we'll import.

In the below code snippets we see two examples: the first is a query with the derived value added from the eltInitial table, which we might see as a final query in a report. This structure works when our derived value is a final value and our source is an existing table we can query. If this was only part of the process, we would be inserting these data into another table, and the same is true if we were using the derived values as part of a larger query. As long as their base values exist, we can add the derived values in the final step for the report.

--Q1
DECLARE @avg DECIMAL(7,4)
SELECT @avg = AVG(RepValue) FROM eltInitial
SELECT 
	RepDate
	, RepValue
	, (RepValue - @avg) ValAboveAvg
FROM eltInitial

It's possible that we want to use the derived values for something that won't be on a final report - so we want to add them early for use that will be important for a future query and the final report which will not have them. Using our test data that we created, this would be like a final report that simply shows the final average value along with the average of the total value above average values - in this case, we only need the derived values for further analysis. In this second approach, we see on popular import approach: a table being created, the data from the CSV file we created being imported, a column being added to the table with an update to create the ValAboveAvg and a final query ready for the next step in our ETL process. With a process like this, we may have a step for validating data prior to deriving a value, such as cleaning and checking values, prior to loading data elsewhere. If I was using BULK INSERT, I would clean the data first following the import; with the below PowerShell, I could add validation within the script import. Since we are looking at adding derived values in ETL flows, this is an important consideration outside the topic.

### PowerShell import
Function Execute-Sql {
    Param(
        [string]$command
    )
    Process
    {
        $scon = New-Object System.Data.SqlClient.SqlConnection
        $scon.ConnectionString = "Data Source=OurServer;Initial Catalog=OurDatabase;Integrated Security=true"
        
        $cmd = New-Object System.Data.SqlClient.SqlCommand
        $cmd.Connection = $scon
        $cmd.CommandTimeout = 0
        $cmd.CommandText = $command

        try
        {
            $scon.Open()
            $cmd.ExecuteNonQuery()
        }
        catch [Exception]
        {
            Write-Error $_.Exception.Message
        }
        finally
        {
            $scon.Dispose()
            $cmd.Dispose()
        }
    }
}

Function Read-IntoTable {
    Param(
        [string]$file
    )
    Process
    {
        $readfile = New-Object System.IO.StreamReader($file)
        $query = ""

        try {
        
            while ($line = $readfile.ReadLine())
            {
                $date = $line.Split(",")[0]
                [double]$value = $line.Split(",")[1]
                $query += "INSERT INTO etlImport VALUES ('$date',$value)
                "
            }
            Execute-Sql -command $query
        }
        catch [Exception]
        {
            Write-Error $_.Exception.Message
        }
        finally
        {
            $readfile.Dispose()
        }
    }
}

Read-IntoTable -file "C:\ETL\etlinitial.csv"

---TSQL:
CREATE TABLE etlImport(
	RepDate DATE,
	RepValue DECIMAL(15,6),
)

--- Following the PowerShell import:
ALTER TABLE etlImport ADD ValAboveAvg DECIMAL(11,6)

DECLARE @avg DECIMAL(7,4)
SELECT @avg = AVG(RepValue) FROM etlImport

UPDATE etlImport
SET ValAboveAvg = (RepValue - @avg)

--- Ready for next step:
SELECT *
FROM etlImport

Finally, we may want to allow users to pass in a configurable value against our average (or other aggregate) on our final report. We'll take the same logic from our first approach since this involves the final report and we'll add a configurable value where we can increase or decrease our average to our preference:

Note that 1.10 would be increasing by 10 percent while 0.90 would be decreasing by 10%.

DECLARE @avg DECIMAL(7,4), @configurable DECIMAL(5,4)
-- Example number that a user would pass in:
SET @configurable = 0.90
SELECT @avg = AVG(RepValue) FROM eltInitial
SELECT 
	RepDate
	, RepValue
	, (RepValue - (@avg*@configurable)) ValAboveAvgAdjusted
FROM eltInitial

While we used average in this tip, we could use other aggregate functions such as maximum, minimum, etc. For an example, we might want to conservatively estimate how much oil we will extract from a well and use a minimum with a configurable value that lowers it further. Relative to what we want our final report to look like we may want the derived value and remove the base values (especially if there are multiple columns, which will add to the costs). During our extraction and load, we can remove those columns in the SELECT statements once we determine they are no longer needed, which will help reduce resource use.

Next Steps

In some cases, a derived value may simply be a conversion. It may also be used in forecasting and adding the ability to configure it higher or lower is useful for reports.
One approach we used was adding the derived value on the final report which reduces the need to store this information - something that may not be required. As long as the base values exist in the final report and we need to show the derived values, we don't need to store it in these situations since we can always derive it.
Sometimes we need a derived value for further analysis, such as performing aggregates on derived values. In these cases, we may want to store them only as a part of a temporary ETL flow.
As soon as a base or derived value is no longer required during our data migration and loading, we should remove those values from the SELECT statements.

About the author

Tim Smith works as a DBA and developer and also teaches Automating ETL on Udemy.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips

Considering Derived Values In ETL Flows

Problem

Solution

Next Steps

About the author

Comments For This Article