Techniques to handle timing issues and inconsistent data for SQL Server data loads
By: Tim Smith | Updated: 2017-02-14 | Comments | Related: More > Import and Export
We have several APIs where we extract data and the sources are not always consistent in timing and data. Sometimes an API that we extract from is offline for a period of time, but our users still want to see the most recent data we have. In other times, we receive data from an API that appears to be significantly higher or lower than what we would expect, making it difficult to know if it's accurate and it should be reported. These also tend to be periods where there are disruptions in the API as well, where we get no data again. What are some ways we can handle APIs that are inconsistent like this when users want reports regardless of whether data are most recent or not?
The answer depends on if the data provided to your clients need to be real-time and if your clients understand that the data they see, provided from an API (or multiple) may be inaccurate, delayed, or experience outlier values which could be valid, but affect what they need. We will assume in this tip that your clients are warned whether the data are real-time, possibly delayed, or are in aggregate to reduce likelihood of data issues, and focus on some technical solutions for showing the most recent data.
Consider the following scenarios:
- I want to know the latest data in the last [time period] and would rather see nothing if nothing returns.
- I want to know what the last data point was - if there's currently nothing, what was the last point?
- I want to know the tendency of the most recent data.
Let's look at three different ways you could report live API data using T-SQL to handle the following cases: using time restrictions, using recent timing, or using weights. While each of these carry advantages, depending on your need, they also carry disadvantages and we'll look at each.
Time restricting API data
This is a standard reporting approach where we filter by time in our where clause. A simple example of this where we return all the values from the last day:
SELECT OurColumn FROM OurTable WHERE OurDate > DATEADD(DD,-1,GETDATE())
If we choose this method, this allows us to disclose the timeframe up front - for an example, if we changed the filter to report in the last ten minutes, we would disclose that. This means that users of the data know that the data come from within that timeframe. However, if an API we extract from fails to report in that timeframe, the report may be blank, depending on the API frequency. How often do the data change and how often should they change? If the answer is all the time, in some cases, it is much better to report nothing if we've received nothing. In the case of the problem, this may not be appropriate for users though - if an API reports nothing and they still want to know what was the most recent data, the below cases may be more appropriate.
Reporting the latest data we have
For some reporting purposes, we want our reports to return the following logic - "The last time we checked [x], it was [y]." A simple example of this using the same naming convention as above this:
SELECT TOP 10 OurColumn ---- Explanation below FROM OurTable ORDER BY OurDate DESC
One benefit of the TOP approach with T-SQL is that we can return the latest number of values needed for a report, like returning the temperature at the last time it was checked - even if that wasn't within a time period (maybe the thermometer broke). In the question above this, if an API is down and no data are coming in, does the user want to see nothing, does the user want to see the last value (TOP 1), or does the user want to see the last top values (TOP n)? For reports that must always have most recent data obtained - because users only care about the last value, not whether the last value is the latest from the source - this meets that purpose. Where we might see this in practice would be a report showing how many users were online at a site, regardless of whether the site is currently online (which would be 0) - the last time it was online and checked might be helpful for some reports.
Reporting the data tendency in the last number of values that we have
This reporting technique uses a combination of the above approach while also considering the weight of the most recent values. This approach is strongest when we have data sets with multiple sources and when users want to know the most recent values, even if multiple data sources are reporting nothing at the moment. This queries a subset of the most recent data using TOP, and collects their average - which would be appropriate in some cases. Like the above two techniques, we could use a date filter, if we're willing to report no data in the event of data being missing, or we could use the TOP selection, which returns the latest when ordered by time. If multiple data sources exist, the OurTable in this example would be a collection of those multiple sources (we could also union a select of the most recent data from multiple sources):
SELECT AVG(OurColumn) ---- These may be desired too: --, MAX(OurColumn) --, MIN(OurColumn) FROM ( SELECT TOP 5 OurColumn FROM OurTable ORDER BY OurDate DESC )
Suppose that we have a set of interest rates that are 8%, 9%, 10%, 11%, 12%. We know the average is 10% and these values are close to each other, so the data tendency in this case is the average with the high of 12% and the low of 8%. When we use this approach, our users generally want a summary of the recent events in these data (used in the financial industry - stock highs, lows, opens, closes, etc., for the day). Remember that tendency is only what something "generally is" not what it "actually is at [x] point in time." This approach is incredibly useful when multiple API sources exist and these sources are used to track overall patterns, or used to reconcile data.
In this example, we use the average. However, if we keep a history of these data and determine that large swings can occur in the data, based on that behavior, we may prefer to use the median value of the TOP subset of data, or we could add more values to the TOP to centralize the data weight (though, this is predicated on the assumption that more data points equal a more centralized weight).
We can also combine these; for an example, the below code reports the latest value (TOP) we have and the last value within the last minute (time restricted); provided that we're receiving data without issue, these should always be the same in this example. However, if the time-restricted latest value returns nothing, we still know the last value that was received.
SELECT TOP 1 OurColumn LatestValue ,(SELECT OurColumn FROM OurTable WHERE OurDate BETWEEN DATEADD(MINUTE,-1,GETDATE()) AND GETDATE()) ValueLastMinute FROM OurTable ORDER BY OurDate DESC
- Disclose to clients what the timing parameters are of reporting data upfront and what will be reported if the source is unable to provide the latest data.
- Especially with live API extraction in ETL, based on the above parameters, consider where you handle the missing data. In some cases, missing data, or unreported data, may be what's needed, thus carrying that from the extract process will be required.
- With volatile and inconsistent data sets, consider weighing the data reported, provided that's clear to the end-user.
Last Updated: 2017-02-14
About the author
View all my tips