Preventing Invalid Data Early In An ETL Flow

By: Tim Smith | Comments (2) | Related: More > Import and Export

Problem

We receive data from thousands of sources, some of which provide direct access and some of which manually enter their data into their system or our system. We run a multi-server and database ETL flow that obtains the original data, transforms the data to meaningful information, and finally presents the data to clients. Our clients have identified errors in our data (the presentation layer of our data), causing them to question other data we present to them. We’ve been able to identify that many of these errors come from sources that tend to use manual entry in their data flows, but we want to prevent these errors from ever making it into the presentation layer where our clients see the data.

Solution

We want to catch possible invalid data points as early as possible in our ETL flow, as clients will start to doubt our data if they see errors in the final data. If our data flow has 3 nodes, where the first node is the import node, the second node is the transformation node, and the third node is the presentation node, we may not identify some errors until the final node. However, it’s possible that we can catch these errors earlier in our ETL flow.

In this tip, we’ll look at some of the considerations for both data rules and logic to handle these rules.

Define strict data rules as early as possible

Since we have a multi-server ETL flow, we want to spend less performance and processing time later in the flow and invalidate as may bad data points as possible early in the flow. We also want to eliminate the risk that clients receive low quality data. In the below example set of data from a contrived import, one of the data points had an InputValve value of NULL while also having a measurement date after the import date. Suppose that these violate our data rules – if we know this right after the import, we can invalidate these data immediately. The remaining data may be invalid due to data rules that we have later in the flow, but they don’t violate the rules we have now.

First, we'll use a SELECT statement to highlight how this row violates our data rules. The WHERE clause specifies the rules for our data. We then set this row to be flagged in an update.

SELECT *
FROM tbValveImport
WHERE 
	ValveId IS NULL
	OR ValveInput IS NULL
	OR ValveOutput IS NULL
	OR MeasurementTime >= ImportDate

UPDATE tbValveImport
SET ImportFlag = 1
WHERE 
	ValveId IS NULL
	OR ValveInput IS NULL
	OR ValveOutput IS NULL
	OR MeasurementTime >= ImportDate

SELECT *
FROM tbValveImport

At this point, any data with an ImportFlag of 1 would not migrate to the next node for transformations. If we need these data for reference, such as contacting the source, we would log these data further. Remember that T-SQL is a set-based language. Validating data points row-by-row will not be cost effective with T-SQL, and if we can remove some data point by set (such as where one data is earlier than another date), we want to remove these invalid data points by the set as early in the ETL process as possible. We may later have to look at data points on their own and we can use more optimized tools for these later challenges.

Identify the data sources that generate later problems and demarcate them

Separate problematic data sources to their own destinations, such as databases or sets of tables. If we find that certain data sources report data inaccurately due to manual entry, or a questionable ETL flow, we can keep these data in their own destinations to automate the next-step when we identify problems, as well as reduce the performance impact of validating data. Problematic data sources can often invite further data validation, which may impact other data sources which don't generate a problem. An example of this is if we had 5 venders who provided data and we saw many inaccuracies in one data vender, we would copy our import process to another database, or destination for the data so we could add to our validation.

Part of this helps with organization and part of this flags questionable data where we may want to add extra logic for validation. As we’ll see in latter steps, this may increase performance as the performance time in validating these poor quality data sources won’t interfere with other data validation for data sources where we rarely see errors. In general, the 80-20 rule applies with inaccurate data sources – 80% of erroneous data tend to come from 20% (or fewer sources) and in many cases, these are caused by manual entry. Using the same WHERE logic in the above example, suppose we had a vender who allowed manual entry, which causes possible erroneous measurements with negative numbers and invalid ValveIds. We would need to add to our WHERE logic:

SELECT *
FROM tbValveImport
WHERE 
	ValveId IS NULL
	OR LEN(ValveId) < 11
	OR LEN(ValveId) > 11
	OR ValveInput IS NULL
	OR ValveInput < 1
	OR ValveOutput < 0
	OR ValveOutput IS NULL
	OR MeasurementTime >= ImportDate

As we can see, the more we add to the WHERE clause, the more it can possibly have an impact. In addition, we may need to add further validation, such as checking for duplicates, which other data sources may not require.

In rare cases, migrate data from a later step back to the early step for validation

In some cases, we may be able to migrate data from a later step - such as the post transformation or presentation layer - to the import step because we can immediately validate a data point we received in a set. In these situations, we don't want to replicate the full presentation or transformation layer, but the specific data points that can be used to validate data in our import layer. In our example, suppose that our business logic states that once ValveInput hits the value of 100, the next value for the ValveId must be less than 100 (and greater than 1). In this example, we don't need to migrate or replicate all the latter data, we would only migrate the ValveId and the previous ValveInput value as these would be the only two required for validation. In addition, these data points come from latter in the ETL flow since other validation is required.

SELECT 
	t.ValveId
	, t.ValveInput
	, tt.ValveInput
FROM tbValveImport t
	---- This table holds data from later steps
	INNER JOIN tbValveInputValidation tt ON t.ValveId = tt.ValveId

UPDATE tbValveImport
SET tbValveImport.ImportFlag = 1
FROM tbValveImport
	INNER JOIN tbValveInputValidation ON tbValveImport.ValveId = tbValveInputValidation.ValveId
WHERE tbValveInputValidation.ValveInput = 100
	AND tbValveImport.ValveInput >= tbValveInputValidation.ValveInput

SELECT * 
FROM tbValveImport
WHERE ImportFlag = 1

As we can see, in this case we would have flagged another value that could be prevented to continue our flow. In some cases, we may need to migrate more than just a recent data point, so whether we decided to use a migration query in SSIS or a form of replication (only for the intent of reading for comparison), we have tools that we can use to move some data from latter ETL steps back to an early step. In some situations, this isn't an option and we'll have to wait to validate data later.

We've looked at some steps for reducing the likelihood that invalid data will continue an ETL flow. As the flow continues, we would still apply strict business logic to other layers, such as logic to the transformation layer using the same systems we've seen. With questionable data sources, I would still keep these data demarcated until the final step, as they will require more processing from validation throughout the flow.

Next Steps

Aim to catch errors as early in the data flow process as possible, provided it’s cost efficient to do so (set of invalid data versus specific data points that are invalid).
If we have a set of data points in our presentation layer that could invalidate new data, either replicate or migrate these final data points to prevent new data from continuing in our ETL flow when they are invalid.
While the cost of further validation may be high, in some situations, this may be a necessary step. We may be able to obtain other data sources, which allow us to compare and contrast data to ensure we’re providing clients with the most accurate data as possible.

About the author

Tim Smith works as a DBA and developer and also teaches Automating ETL on Udemy.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips