Time Series Data Fact and Dimension Tables for SQL Server


By:   |   Updated: 2020-02-14   |   Comments (3)   |   Related: More > T-SQL

Problem

Demonstrate some simple examples of how to build, populate, and query a data warehouse for time series data in SQL Server.  Please present sample code and results that facilitates adapting the solution to other time series datasets.

Solution

A prior tip illustrated how to collect time series data from Yahoo Finance to populate a table in SQL Server.  Stock price and volume data are a free, public source for time series data.  The table of values collected from Yahoo Finance contained millions of rows!

This tip demonstrates how to build on the prior tip by populating a data warehouse with the collected data.  More specifically, you will see how to create dimension and fact tables for a data warehouse.  You will also discover simple examples of how to query the data warehouse to answer common questions for time series data.

An overview of the data warehouse

The screen shot below shows a database diagram of the data warehouse model.  The model’s design can be easily adapted to time series data from many different domains.

  • The fact table in this tip has the name yahoo_prices_valid_vols_only.  This table contains three kinds of data common for time series datasets.
    • The Date column denotes a datetime dimension – when were the facts assessed.
    • The Symbol column denotes an entity type for the tracking of time series data.  In this case, the entity is a stock symbol.  However, it could as easily denote a product sold by a company or a weather station from which temperature and precipitation are measured.
    • The Open, High, Low, Close, and Volume columns denote measures on entities that can change over time.
  • A dimension table in a data warehouse model characterizes a column in the fact table as belonging to a dimension value, such as a date or a symbol.  The diagram below shows two tables used for storing dimension values.
    • The datedimension table denotes sequential trading dates and related measures to help track and analyze datetime values.
    • The symboldimension table contains values identifying different entities.  In this tip, the entities are symbols, but they can be anything else about which you are collecting measurements over time.  The symboldimension table, like the datedimension table, includes one or more additional columns; the additional column in the symboldimension table is named Exchange.  The Exchange column values denote the name of the exchange on which a symbol trades (AMEX, NASDAQ, and NYSE).
data_warehouse_fig_1

Building and populating the dimension tables

When designing a data warehouse in SQL Server, you will typically build and populate the dimension tables prior to the fact table.  This is because the fact table specification references the dimension tables.

All dimension tables for time series data must have a dimension pointing at datetime units.  The dimension table for datetime in this tip has the name datedimension.  The code at the beginning of the following script conditionally removes any prior version of the datedimension table.  The specification for the datedimension table appears below.

  • Its primary key is date, which has a date data type.
  • The dayname and daynumber columns contain two different ways of characterizing the day of the week for a date value.
  • The monthname and monthnumber columns contain two different ways of characterizing the month to which a date belongs.
  • The column named year contains an integer value for the year in a date.
  • Note from the comments at the top of the script that no foreign key reference should point at the dimension table prior to its creation.  This can become an issue if you try to create a dimension table after creating a fact table.
-- drop table for dbo.datedimension if it exists
-- assumes no foreign key reference already exists to the table
-- view the code in the "Building and populating the fact table" section
-- for sample code to remove a foreign reference to this table
if (select(object_id('dbo.datedimension'))) is not null
drop table dbo.datedimension
go 
 
-- create table for datedimension
create table dbo.datedimension(
   [Date] date not null primary key,
   [dayname] nvarchar(10) not null,
   [daynumber] tinyint not null,
   [monthname] nvarchar(10) not null,
   [monthnumber] tinyint not null,
   [year] int not null
) 

The following script populates and displays the datedimension table.  An insert statement populates the datedimension table from a select statement, which has an inner select statement named for_distinct_dates.  The query for_distinct_dates extracts distinct dates from a table named yahoo_prices_volumes_for_MSSQLTips.  The process for populating the table in the sub-query appears in the prior tip for collecting time series data for over 8,000 symbols from Yahoo Finance.  Dayname, daynumber, monthname, monthnumber, and year are computed fields based on date column values.

-- simple date dimension table
insert datedimension (date, dayname, daynumber, monthname, monthnumber, year)
select 
 date
,datename(weekday,date) dayname
,datepart(weekday,date) daynumber
,datename(month, date) monthname
,month(date) monthnumber
,year(date) year
from (select distinct date date from dbo.yahoo_prices_volumes_for_MSSQLTips) for_distinct_dates
order by date
 
-- echo datedimension
select * from datedimension

The following screen shot shows the first and last six rows from the datedimension table.

  • The date values start with the first trading date in 2009, January 2, 2009.
  • The date values end with the last trading date for which prices and volumes were collected.  This date is October 7, 2019.
  • The daynumber and monthnumber column values facilitate sorting query results by day-of-week and month order instead of alphabetical order for dayname or monthname.
    • Daynumber refers to a weekday number starting with 2 for Monday.  The last trading date for a week corresponds to a daynumber value of 6 for Friday.
    • Monthnumber refers to the number for a month starting with 1 for January and running through 12 for December.
data_warehouse_fig_2
data_warehouse_fig_3

You will typically need a categorical dimension to track entities to which time series measurements belong.  This tip uses the name symboldimension for the table holding entity identifiers and other values that characterize entities.  The entities in this tip are symbols trading on one of three stock exchanges.  Each stock symbol is characterized by an exchange name where the stock symbol is listed for trading.

Here’s the T-SQL specifying the symboldimension table.  Some conditionally executed code at the beginning of the script removes any prior version of the symboldimension table in the dbo schema.  This dimension table has two columns – one for the symbol and another for the exchange on which a symbol trades.  Both columns have an nvarchar data type.

-- drop table for dbo.symboldimension if it exists
-- assumes no foreign key reference already exists to the table
-- view the code in the "Building and populating the fact table" section
-- for sample code to remove a foreign reference to this table
if (select(object_id('dbo.symboldimension'))) is not null
drop table dbo.symboldimension
go 
 
-- create table for symboldimension
create table dbo.symboldimension(
   Symbol nvarchar(10) not null  primary key,
   Exchange nvarchar(10) not null
) 

The following T-SQL script populates the symboldimension table based on two sources.

  • The yahoo_prices_volumes_for_MSSQLTips table is a table with time series data for the symbols.  The distinct keyword in the select statement within the from clause returns just one row for each symbol across all the time series rows for each symbol.  The time series data is derived originally from Yahoo Finance.
  • The ExchangeSymbols table is from the eoddata company, which provides Exchange names for where the symbols are listed for trading.  This information is available without charge; the process for collecting the data is described here.
-- simple symboldimension table
insert symboldimension 
select one_row_one_symbol.symbol, ExchangeSymbols.Exchange
from
(select distinct symbol symbol
from dbo.yahoo_prices_volumes_for_MSSQLTips) one_row_one_symbol
inner join dbo.ExchangeSymbols
on one_row_one_symbol.symbol = ExchangeSymbols.symbol

Here are two excerpts from the symboldimension table.  The first excerpt shows the first six rows from the table, and the second excerpt from the table shows the last six rows.

  • The first excerpt shows the symbols starting in alphabetical order (Symbol is the primary key for the table).  The second column is an identifier for the exchange on which a symbol trades.
  • The second excerpt shows there are 8089 symbols in the time series dataset.
data_warehouse_fig_4
data_warehouse_fig_5

Building and populating the fact table

Here’s the script for specifying the fact table.  The fact table stores the time series data for the entities in the data warehouse.

  • The name for the fact table is yahoo_prices_valid_vols_only.  This name reflects fact that the source data are cleaned in some way from the original source data from Yahoo finance.  In particular, the fact table only includes rows for trading dates on which there is a valid volume value – namely one that is neither null nor zero.  A date is not a trading date if no shares are exchanged for the symbol on that date.
  • There are seven columns in the fact table.
    • Date references the Date column in the datedimension table.
    • Symbol references the Symbol column in the symboldimension table.
    • Open through Volume represent measures on a symbol for a trading date.
      • Open through Close represent different prices for a security during a trading date.
        • Open is the opening price on a trade date.
        • High is the largest price over the course of a trade date.
        • Low is the smallest price over the course of a trade date.
        • Close is the closing price on a trade date.
      • Volume denotes the number of shares exchanged during a trading date.
  • The script closes with the addition of two foreign key constraints that tie the fact table to the dimension tables as well as a primary key based on Symbol and Date.  Every row in a time series dataset is unique based on the entity and date for which the measures were collected.
    • The fk_date constraint links Date column values from the fact table to Date column values in the datedimension table.
    • The fk_symbol constraint links Symbol column values from the fact table to Symbol column values in the symboldimension table.
    • The primary key is not strictly necessary.  This constraint can facilitate visual verification of the results sets derived for queries based on the fact table along with dimension tables.
    • The go keywords after alter table statements for constrains are not strictly necessary, but their use facilitates debugging statements to conditionally drop and add constraints through the display of error statements for code blocks.
-- drop and re-create dbo.yahoo_prices_valid_vols_only table 
-- dbo.yahoo_prices_valid_vols_only is the fact table for the data warehouse
if (select(object_id('dbo.yahoo_prices_valid_vols_only'))) is not null
drop table dbo.yahoo_prices_valid_vols_only
go
 
create table dbo.yahoo_prices_valid_vols_only(
   [Date] [date] NOT NULL,
   [Symbol] [nvarchar](10) NOT NULL,
   [Open] [money] NULL,
   [High] [money] NULL,
   [Low] [money] NULL,
   [Close] [money] NULL,
   [Volume] [bigint] NULL
)
 
go
 
--  conditionally drop primary key named pk_symbol_date
if (select(object_id('pk_symbol_date'))) is not null
alter table dbo.yahoo_prices_valid_vols_only drop constraint pk_symbol_date;
go
 
alter table dbo.yahoo_prices_valid_vols_only
   add constraint pk_symbol_date primary key clustered (Symbol, Date);
go
 
 
-- conditionally drop foreign key constraint named fk_date
if (select(object_id('fk_date'))) is not null
alter table dbo.yahoo_prices_valid_vols_only drop constraint fk_date;
go
 
alter table dbo.yahoo_prices_valid_vols_only
   add constraint fk_date foreign key (Date)
      references dbo.datedimension (date)
      on delete cascade
      on update cascade;
go
 
 
-- conditionally drop foreign key constraint named fk_symbol
if (select(object_id('fk_symbol'))) is not null
alter table dbo.yahoo_prices_valid_vols_only drop constraint fk_symbol;
go
 
alter table dbo.yahoo_prices_valid_vols_only
   add constraint fk_symbol foreign key (Symbol)
      references dbo.symboldimension (Symbol)
      on delete cascade
      on update cascade;

Here’s a T-SQL script for populating the fact table from the yahoo_prices_volumes_for_MSSQLTips table, which contains the original source data from Yahoo Finance.

  • The where clause performs cleaning of the source data to remove rows for trading dates during which no shares were exchanged.  The comment lines at the top of the following script denote the original number of rows in the source data and the number of rows in the cleaned fact table.
  • The order by clause is not strictly necessary, but it ensures the time series data pumped into the fact table is arranged by date within symbol.  If you care to visually inspect the source data before insertion into the fact table, the order by clause adds value.
-- populate dbo.yahoo_prices_valid_vols_only
-- yahoo_prices_volumes_for_MSSQLTips 
-- with volume is not null and volume != 0
-- source table has 15483188 rows
-- where clause filters out 862303 rows
-- cleaned results set has 14620885 rows
insert into dbo.yahoo_prices_valid_vols_only
select 
 [Date]
,[Symbol]
,[Open]
,High
,Low
,[Close]
,Volume
from dbo.yahoo_prices_volumes_for_MSSQLTips
where ([volume] is not null) and ([volume] != 0)
order by Symbol, Date

Here are excerpts from the full display of the fact table via a T-SQL script (select * from dbo.yahoo_prices_valid_vols_only).

  • The first screen shot is for the first six rows from the fact table.  This shot features the A symbol beginning with the first trade date in the table.
  • The second screen shot is for the last six rows from the fact table.  This shot features the ZYXI symbol ending with the last trade date in the table.
  • The primary key for the fact table determines the order of rows by trade date within symbols.
data_warehouse_fig_6
data_warehouse_fig_7

Introduction to querying the fact and dimension tables

This section introduces you to some basics for querying fact and dimension tables with the data from this tip.  The following query computes three aggregations of the data for the fact table joined to the dimension tables.

  • The aggregations are for
    • the count of distinct symbols from the symboldimension table
    • the count of distinct trading dates from the datedimension table
    • the count of trading dates for all symbols across all dates in the fact table
  • The data source for the query is the fact table inner joined to
    • the datedimension table by Date
    • the symboldimension table by Symbol
  • The query groups the joined data sources by Exchange column values from the symboldimension table
-- aggregation for count of distinct symbols,
-- count of distinct date dimension values (distinct_trading_dates) and
-- count of rows (number_of_trading_dates)
-- grouped by and ordered by Exchange in symboldimension
select 
 symboldimension.Exchange
,count(distinct symboldimension.Symbol) distinct_symbols
,count(distinct datedimension.Date) distinct_trading_dates
,count(*) number_of_trading_dates
from dbo.yahoo_prices_valid_vols_only
inner join datedimension
on yahoo_prices_valid_vols_only.Date = datedimension.Date
inner join symboldimension
on yahoo_prices_valid_vols_only.Symbol = symboldimension.symbol
group by symboldimension.Exchange

Some readers may be able to enhance their understanding of the query code by examining its results set. 

  • There are three rows in the results set – one for each of the three stock exchanges contributing time series data to the data warehouse.
  • Aside from the column values denoting exchanges, there are three columns for aggregate values.
    • The distinct_symbols column presents a count of the distinct symbols from the symboldimension table.  The sum of the values in the distinct_symbols column across all three exchanges (3309+2198+2582) is 8089.  The listing of symbol values in the symboldimension table confirms this value.
    • The count of distinct trading dates is the same for all three exchanges.  This value is 2709.  The listing of date values in the datedimension table confirms this value.
    • The number of trading dates varies by exchange – especially for the AMEX exchange relative to the NASDAQ and NYSE exchanges.  The sum of the number of trading dates across all three exchanges (5746100+3347574+5527211) is 14620885.  The listing of rows in the fact table confirms this value.
data_warehouse_fig_8

Is there a day-of-week effect?

One reason for performing queries on the contents of a data warehouse is to test different hypotheses about the warehouse contents.  For example, some manager or analyst may ask: are close prices consistently higher on one day of the week and lower on another day of the week?  This section reviews a couple of query designs with their results sets to illustrate different approaches for developing results bearing on a question like this.

The first query design computes the average close price and count of trading dates for each day of the week.  The code for the query design appears below.  The query design is a nested one.

  • The inner query named for_daily_average_close extracts base data from the fact table and the datedimension table.
    • The fact table provides base data for close prices along with their trading dates.
    • The datedimension table provides base data on the day of the week to which a trading date belongs.  Day of the week is denoted by both daynumber and dayname.
  • The outer query includes a group by clause for daynumber and dayname for the results set from the inner query.  The outer query also orders its result set rows by dayname within daynumber.
    • The select list for the outer query includes aggregate functions for computing average close price and number of trade dates by day of the week.
    • The select list also includes daynumber and dayname fields to work in coordination with the group by clause.  These fields support the display of average close price and number of trade dates by day of the week.
    • The final order by clause arranges the rows of output from the outer query in day-of-the-week order instead of alphabetical order by dayname.
-- close by day of week across all symbols
select daynumber, dayname, avg([close]) [Close], count(*) trade_dates
from
(
select 
 yahoo_prices_valid_vols_only.Date
,datedimension.daynumber
,datedimension.dayname
,dbo.yahoo_prices_valid_vols_only.[close]
from dbo.yahoo_prices_valid_vols_only
inner join datedimension
on yahoo_prices_valid_vols_only.Date = datedimension.Date
) for_daily_average_close
group by daynumber, dayname
order by daynumber, dayname

Here’s the results set from the first query design.

  • You can see that the daynumber column values appear in numerical order from 2 through 6 so the dayname column values appear from Monday through Friday.
  • There are two easily identifiable points about the close prices and number of trading dates.
    • The largest average close price is for Wednesday (appears with a yellow background), and the smallest average close price is for Monday (appears with a red background).
    • The weekdays of Tuesday through Friday each have about 2.9 million trading dates, but Monday only has about 2.75 million trading dates.
data_warehouse_fig_9

The preceding screen shot shows average close price and number of trading dates across all symbols.  Recall that the data in the data warehouse are for over 8,000 symbols.  Do the results across all symbols match those for individual stock symbols?

The following script shows an adaptation of the preceding script.  Instead of reporting average close price and number of trading dates by day of the week across all symbols, the following script reports these results for a single symbol  -- namely, SPY, which is a symbol based on the aggregation of price and volume for all individual symbols in the S&P 500 index.

  • As with the preceding query, this new query design relies on a nested query.
  • The inner query has two changes from the preceding query design for results across all symbols.
    • The inner query includes an inner join of the fact table with the symboldimension table.
    • In addition, the inner query includes a where clause to return a results set of average close price and number of trading dates only for the SPY symbol.
  • The outer query also has two changes from the preceding query design for results across all symbols.
    • The group by clause is for Symbol as well as daynumber and dayname.
    • Also, Symbol appears in the first field in the select list.
-- close by day of week for SPY
select Symbol, daynumber, dayname, avg([close]) [Close], count(*) trade_dates
from
(
select 
 yahoo_prices_valid_vols_only.Symbol
,yahoo_prices_valid_vols_only.Date
,datedimension.daynumber
,datedimension.dayname
,dbo.yahoo_prices_valid_vols_only.[close]
from dbo.yahoo_prices_valid_vols_only
inner join datedimension
on yahoo_prices_valid_vols_only.Date = datedimension.Date
inner join symboldimension
on yahoo_prices_valid_vols_only.Symbol = symboldimension.symbol
where symboldimension.symbol = 'SPY' 
) for_daily_average_close
group by Symbol, daynumber, dayname
order by daynumber, dayname

By replacing SPY in the preceding script with other symbols, you can easily generate results for a range of individual symbol values.  In this way, you can generate results that allow a verification of whether the average results for weekday effects across all symbols is consistent with comparable results for individual symbols.  Of course, you can also assess if the weekday effects are the same across a select set of individual symbols.  The following screen shot shows the results for this select set of individual symbols: SPY, QQQ, MSFT, SOXL, ADBE, and ENPH.

  • The largest average weekday price for each symbol has its dayname and average close price highlighted in yellow.
  • The smallest average weekday price for each symbol has its dayname and average close price highlighted in red.

The largest average close price is on Friday for all the selected symbols, except ENPH.  This outcome is at variance with the results across all symbols where the average highest close price is on Wednesday.  The smallest average close price varies among the members of the select set of individual symbols.  Also, the smallest average weekday close across all symbols does not match the smallest weekday close for any of the select set of symbols.  The easy take-away from these query results is that there is no evidence of a consistent day-of-the-week effect

  • for largest or smallest close prices across different symbols
  • or for a select set of symbols versus all the symbols.
data_warehouse_fig_10
Next Steps

Try out the code examples for this tip.  All the referenced scripts are in this tip’s download file.  You will additionally need to run the scripts from this other tip to create the dbo.yahoo_prices_volumes_for_MSSQLTips table and ExchangeSymbols table to which this tip refers.

  • Try to duplicate the results in this tip with the same symbols as in this tip.
  • If you are not interested in three major US exchanges, try using symbols from other exchanges available from eoddata.
  • You can also try to use the approach described here for ticker symbol data to other kinds of data, such as readings from weather stations or product sales and production data over time.
  • Finally, leave a comment about your success in adapting this tip for your business as well as any related issues you want covered in future tips.


Last Updated: 2020-02-14


get scripts

next tip button



About the author
MSSQLTips author Rick Dobson Rick Dobson is a Microsoft Certified Technical Specialist and well accomplished SQL Server and Access author.

View all my tips





Comments For This Article




Friday, February 14, 2020 - 10:50:20 AM - Luis Eduardo Soares Back To Top (84486)

Very interesting tip. Thanks for sharing!


Friday, February 14, 2020 - 10:04:10 AM - Rick Dobson Back To Top (84484)

Hi Paul,

Thanks for your question.

I am not sure what would be a good answer to your question for all data and in all cases.

When I work with time series data, and when I view time series analysis from others, it is common to omit dates for which there is no data.

Clearly, different domains, such as stock market ohlcv data and factory production data, can have different gaps in different places.  It may be better to have a calendar with all dates whether or not there is any data for all dates.  However, this means that you will, at least some times, just have to perform some downstream processing to omit the dates with no data.  In my experience, analysts who consume time series data do not care about dates that are missing because markets are closed or factories are closed for change over periods.

I designed the calendar with stock data in mind.  It may be that for some other domains you may care to include some or all dates that I omitted.  Your audience of data consumers will likely have some thoughts that they can share you about this kind of issue.  I hope this reply helps.

Rick Dobson


Friday, February 14, 2020 - 5:18:02 AM - Paul Back To Top (84471)

Thank you for this tip! What comes to my mind is the approach to populating your date dimension table.

In this tip, you populate the date dimension table based on your fact table. If there are gaps in the fact table (which there are in this case - weekend days where no trading takes place), there will be gaps in the date dimension table as well.

My question would be if you could elaborate on the difference / benefits / drawbacks when using:

1) date dimension based on a fact table (such as in this case) or

2) script-generated date dimension table out of the internet which contains all dates of all years within a given period.

Regards,

Paul



download


Recommended Reading

Rolling up multiple rows into a single row and column for SQL Server data

SQL Server Loop through Table Rows without Cursor

Using MERGE in SQL Server to insert, update and delete at the same time

Cursor in SQL Server

How to use @@ROWCOUNT in SQL Server





get free sql tips
agree to terms


Learn more about SQL Server tools