By: Rick Dobson | Comments | Related: > TSQL
Problem
Weather data can be useful for certain applications and creating a historical data warehouse can be helpful to align this data with certain events that occured. The data is readily availalbe for download, so how can you build a weather data warehouse in SQL Server?
Solution
Weather can impact many businesses. For example, snowy roads that freeze can result in increased insurance claims because of vehicle accidents. In some states, sustained cold weather can lead to above average levels of heating oil consumption. In other states, sustained hot weather can result in above average levels of demand for electricity to cool homes and workplaces. Understanding and preparing for weather impacts depends on learning how different aspects of a business vary when the weather changes. Businesses can prepare for and assess the impacts of weather changes on their operations by building a weather data warehouse.
The National Weather Service (NWS) Cooperative Observer Program (COOP) is sometimes referred to as the nation's weather and climate observing network of, by and for the people. Weather observations are collected at thousands of weather stations across the United States and its territories. This tip leverages several prior tips about gathering and processing weather data from COOP inside of SQL Server for the purpose of populating a data warehouse.
- You will gain exposure to basic kinds of weather data.
- You will learn how to process weather data for populating a data warehouse.
- You will witness the creation and population of a weather data warehouse based on COOP data.
Types and sources of weather data
Weather occurs at a geographic locale. Weather also changes over time. If you have a collection of locales for making weather observations, then you can report the weather from different locales by jurisdiction as well as over time, such as on a daily basis.
- The sites for making weather observations are often called weather stations. It is common to track weather stations by their latitude and longitude coordinates, elevation above sea level, and their jurisdiction, such as the county and state in which a station is located.
- The observations for this tip that are tracked from weather stations include high and low daily temperature, daily rainfall, daily snowfall and depth, and average wind speed.
The COOP weather stations are listed and documented in the National Oceanic and Atmospheric Administration (NOAA) website. The most recent COOP station metadata is maintained in a pair of txt files that can be downloaded from the NOAA Historical Observing Metadata Repository.
- One text file contains a row for each weather station. Within each row is data about a weather station, such as its latitude and longitude, the state and county in which it resides, its elevation, and identifiers like a station identification code and name.
- The second text file documents the first text file content. For example, the second file contains a description station metadata values.
- The metadata values in the first text file can change over time as stations are added to the COOP network or fresh identifier codes are assigned to existing stations.
- Within the context of this tip, a collection of weather stations was downloaded from the NOAA website as of March 2020, and a cleaned subset of its results are available from the ccsl_still_in_use table in the noaa_data database. This table is documented in this prior tip.
The following screen shot displays the first five rows from the ccsl_still_in_use table in the noaa_data database.
- The column names in capital letters are directly from the NOAA website.
- The column names in lower case letters are transformed representations of LATITUDE and LONGITUDE values in degrees minutes and seconds (dd.mmss). Each degree of latitude or longitude can have from 0 through 60 minutes, and each minute of latitude or longitude can have from 0 through 60 seconds.
- The COOP STATION ID values are similar to a primary key for weather stations,
except that a station id value can change between two difference reference sets.
- For example, a station id value can have one value from a source like the ccsl_still_in_use table.
- The same station can sometimes have a different station id value in another source, such as the weather records data source.
- How do you tell if two different station id values point to the same station? One answer to this question is to compare the latitude and longitude coordinates. If the coordinates are the same in the station metadata table and the weather records data source, then the two sets of station meta data are for the same station.
Descriptions of the different kinds of weather observations available from COOP are available in this source. A weather observation is what an observer reports for a weather attribute, such as the maximum temperature during a date. You can download a file with weather observations from one or more weather stations starting with the NOAA Climate Data Online Search form. A prior tip, Gathering Information for a Weather Data Warehouse in SQL Server, demonstrates the process for creating download files for up to eight stations from the top five most populous states according to the 2010 decennial US census. The prior tip illustrated in a step-by-step way how to submit data specifications for the download files from the NOAA website. The tip also shows one approach for how to copy the contents of the download files into SQL Server tables.
The following screen shot shows an excerpt from object explorer with five download files used by this tip.
- The top of the excerpt has a red border around the name of the database (noaa_data) containing SQL Server tables based on the downloaded weather files from the NOAA website.
- The five tables – one for each of the five most populous states – have a sequential number to the right of their names.
- All the tables reside in the dbo schema.
- The table names all have the same structure.
- The general name structure is xx_daily_weather_from_station_txt_flds where xx represents an abbreviation for the state jurisdiction from which a set of weather observations are made.
- The state abbreviations are, in order,
- ca for California
- fl for Florida
- il for Illinois
- ny for New York
- tx for Texas
The following script enumerates the column names with associated information for each of the five tables in the preceding screen shot.
- The query starts by setting noaa_data as the default database.
- Next, a select statement is performed against the information_schema.columns view.
- Finally, a where clause specifies the tables from which to return column names and associated information.
use noaa_data go select table_name,column_name, ordinal_position, data_type from information_schema.columns where table_name in ( 'ca_daily_weather_from_station_txt_flds' ,'fl_daily_weather_from_station_txt_flds' ,'il_daily_weather_from_station_txt_flds' ,'ny_daily_weather_from_station_txt_flds' ,'tx_daily_weather_from_station_txt_flds' )
The next screen shot shows an excerpt from the results set from the preceding script pasted into an Excel worksheet.
- There is a set of three columns, named column_name, ordinal_position, and data_type for each of the five tables listed in the where clause of the preceding script.
- The five table names in bold
font indicate to which table each set of three columns belongs.
- The first four column names for each table denote the same kind of information
for all five tables. This information is for noaa_station_id, elevation,
latitude_deg_dec, and longitude_deg_dec.
- noaa_station_id is an identifier number for the weather station.
- elevation is the feet above sea level for the weather station.
- latitude_deg_decimal and longitude_deg_dec are the latitude and longitude coordinates in decimal degree format. This format is different than the station metadata table values for latitude and longitude coordinates in degrees, minutes, and seconds. Therefore, in order to compare latitude and longitude coordinates between the two sources, the coordinates from the weather records data source and the station metadata table need to be transformed into one common format. The download file for this tip illustrates a simplified approach to this task. A fuller discussion and demonstration of how to transform coordinate values from degrees, minutes, and seconds to decimal degree format is available in this prior tip.
- The first four column names for each table denote the same kind of information
for all five tables. This information is for noaa_station_id, elevation,
latitude_deg_dec, and longitude_deg_dec.
- The fifth column is for the date of a weather observation. This tip warehouses weather observations from January 1, 2016 through December 31, 2019.
- The columns after date are for weather attributes.
- The number of columns after the date column is different for each of the five state tables.
- The number of columns can be different from one state to the next because the weather stations in some states can measure different weather attributes than in other states.
- All states have columns named prcp (for rainfall), snow (for snowfall), snwd (for snow depth), tmax (for highest daily temperature), and tmin (for lowest daily temperature).
- Some weather attributes appear in some but not all states. For example, awnd (for average wind speed) is in the tables for Florida, New York, and Texas. Not all weather attributes can be selected from every state.
- Because the state weather attribute tables are relational database tables, differences in the order of the columns between the state tables is immaterial.
Overview of the weather data warehouse model
A data warehouse can store data in dim and fact tables. The fact tables store facts about an entity’s attributes. In this tip, one entity is the weather. Weather attributes designate types of weather observations, such as minimum daily temperature (tmin). A weather observation can be made at a weather station on a date. Observations for a weather attribute can occur across multiple dates. Therefore, a fact weather table (FactWeather) can point at two dimensional tables – one for stations (DimStation) and another for dates (DimDate).
The FactStation table and the DimStation table both point at stations. Within the NOAA data sources, station id values are recorded separately from both the station metadata table (ccsl_still_in_use) and the weather records data source. Sometimes, the same weather station as indicated by its latitude and longitude coordinates, can have different station identifier values, such as station id values in the station metadata table and the weather records data source. As indicated in the preceding section, you can determine if a station with one or more different metadata values match by comparing their latitude and longitude values after transforming the coordinate values to a common format.
The following diagram portrays the relationships between the four tables in this tip’s weather data warehouse.
- The weather observations from the FactWeather table relate through foreign keys to the DimStation and DimDate dimensional tables. This is because weather observations are stored on a daily basis for weather stations.
- The metadata for the FactStation table relates to the DimStation dimensional table. The FactStation table provides a single coordinated source for weather stations despite differences that may exist for station id values between the station metadata table and the weather records data source. To force consistency in station identifiers for the same stations, the station_id values for the DimStation table are derived from the noaa_station_id values from the FactStation table.
- The arrows in the diagram represent foreign/primary key relationships. For example, the noaa_station_id column in the FactStation table (foreign key) has a referential integrity link to the station_id column (primary key) in the DimStation table.
Code for creating and populating data warehouse tables and indexes
Because of the way that the NOAA weather data are organized and reported, it is not straightforward to compile the DimStation and FactStation tables for the preceding data warehouse model. Here’s an itemization of station data processing challenges.
- There are two possible reasons for a disconnect between station identifiers
from the station metadata table and the weather records data source.
- A station identifier can exist in the station metadata table but have
no corresponding station identifier in the weather records data source.
- There is no major issue to work around in this case because the station identifier values without corresponding weather records are missing, and they should be missing.
- However, you should be aware that some station identifiers can legitimately be missing. If necessary, you can replace missing station identifiers with identifiers that are in both the station metadata table and the weather records source.
- A weather station can have a different station identifier in the station metadata table versus the weather records data source. Consequently, a return results set from a request to the NOAA website for weather records can return station ID values that are not in the set of weather stations for which weather is requested. The code in this tip and the download file illustrates one possible work-around to this kind of issue.
- A station identifier can exist in the station metadata table but have
no corresponding station identifier in the weather records data source.
- The FactWeather table is a more traditional fact table than the FactStation table. Additionally, the design of the FactWeather table is simplified even further by considering mostly just weather attributes that have at least some weather observations for all states covered in this tip. The awnd (average wind speed) weather attribute is included so you can see an example of how to process a weather attribute that is not present in each state.
- The DimDate table is more or less standard with date identifier columns that are typical of other data warehouses.
Compiling the FactStation and DimStation tables
The warehouse FactStation and DimStation tables start with an extract from the station metadata table (ccsl_still_in_use). Within the context of this tip, the ccsl_still_in_use table resides in the dbo schema of the noaa_data database. The extract is saved in a temporary table named #metadata_weather_station_IDs. Despite the name, the extracted table contains all columns of station metadata – not just the station identifier values (noaa_station_id).
- The code returns a results set with metadata for each of the five most populous states according to the 2010 US decennial census. These states are California (CA), Texas (TX), New York (NY), Florida (FL), and Illinois (IL).
- The select statement responsible for populating the #metadata_weather_station_IDs
table consists of an outer query that unites five results sets – one for
each of the five states. The results set for each state are integrated
via union operators combining the results set for each state into a single results
set.
- The query for each state consists of two queries – one for the first four and another for the last four weather stations within a state by COOP STATION ID values.
- A union operator concatenates the results sets from these two queries.
- An into clause in the outermost query saves the rows from its results set in the #metadata_weather_station_IDs table.
- A separate select statement at the end of the script displays the contents of the #metadata_weather_station_IDs table.
use noaa_data go -- list top and bottom 4 weather stations located in -- each of five states with the largest populations -- from COOP station metadata list -- 40 stations in extract begin try drop table #metadata_weather_station_IDs end try begin catch print '#metadata_weather_station_IDs temp table not available to drop' end catch select * into #metadata_weather_station_IDs from ( select top 4 * from [dbo].[ccsl_still_in_use] where STATEPROV = 'ca' ) ca_top union select * from ( select top 4 * from [dbo].[ccsl_still_in_use] where STATEPROV = 'ca' order by STATEPROV, [COOP STATION ID] desc ) ca_bottom union select * from ( select top 4 * from [dbo].[ccsl_still_in_use] where STATEPROV = 'tx' ) tx_top union select * from ( select top 4 * from [dbo].[ccsl_still_in_use] where STATEPROV = 'tx' order by STATEPROV, [COOP STATION ID] desc )tx_bottom union select * from ( select top 4* from [dbo].[ccsl_still_in_use] where STATEPROV = 'ny' ) ny_top union select * from ( select top 4 * from [dbo].[ccsl_still_in_use] where STATEPROV = 'ny' order by STATEPROV, [COOP STATION ID] desc ) ny_bottom union select * from ( select top 4 * from [dbo].[ccsl_still_in_use] where STATEPROV = 'fl' ) fl_top union select * from ( select top 4 * from [dbo].[ccsl_still_in_use] where STATEPROV = 'fl' order by STATEPROV, [COOP STATION ID] desc ) fl_bottom union select * from ( select top 4 * from [dbo].[ccsl_still_in_use] where STATEPROV = 'il' ) il_top union select * from ( select top 4 * from [dbo].[ccsl_still_in_use] where STATEPROV = 'il' order by STATEPROV, [COOP STATION ID] desc ) il_bottom select * from #metadata_weather_station_IDs
A prior tip, Gathering Information for a Weather Data Warehouse in SQL Server, illustrates a process for requesting weather records from NOAA for the eight stations from each state. The prior tip also demonstrates how to transfer the contents of the NOAA files based on the requests to SQL Server tables. The SQL Server tables are the five tables described at the end of the "Types and sources of weather data" section.
The next script creates a results set based on the same set of columns from the five state tables.
- Each row is for a combination of station id value (noaa_station_id), latitude in decimal degree format (latitude_deg_dec), longitude in decimal degree format (longitude_deg_dec), and state_abbr.
- Because of the way NOAA records weather station id values, along with their latitude and longitude values, there are actually 48 distinct rows in the output (although weather records were requested for just 40 weather stations). The most common reason for 48 distinct row values is for some pairs of rows with the same station id values having slightly different latitude or longitude values on different rows. In one other case, I discovered a record with a missing value indicator (-9999) for latitude and longitude.
-- distinct station id values from the noaa weather search tool -- for the five state tables with latitude and longitude -- coordinates in decimal degree format for each station -- 48 station IDs in extract begin try drop table #stations_in_weather_observations end try begin catch print '#stations_in_weather_observations temp table not available to drop' end catch select * into #stations_in_weather_observations from ( select distinct noaa_station_id, [latitude_deg_dec], [longitude_deg_dec], 'CA' state_abbr from ca_daily_weather_from_station_txt_flds union select distinct noaa_station_id, [latitude_deg_dec], [longitude_deg_dec], 'FL' state_abbr from fl_daily_weather_from_station_txt_flds union select distinct noaa_station_id, [latitude_deg_dec], [longitude_deg_dec], 'IL' state_abbr from il_daily_weather_from_station_txt_flds union select distinct noaa_station_id, [latitude_deg_dec], [longitude_deg_dec], 'NY' state_abbr from ny_daily_weather_from_station_txt_flds union select distinct noaa_station_id, [latitude_deg_dec], [longitude_deg_dec], 'TX' state_abbr from tx_daily_weather_from_station_txt_flds ) for_stations_in_weather_observations order by state_abbr -- all weather station id values with latitude and longitude -- coordinates in decimal degree format from weather records select * from #stations_in_weather_observations
Given these two record sources with station id values from the station metadata table and the weather records data source, it is possible to derive the weather stations for which weather records are available for analysis.
The following three queries shows a preliminary analysis. The comment lines contain important information to help you follow the analysis.
- The first query returns the intersection of all station id values from the weather records (noaa_station_id) with COOP STATION ID values from the station metadata table. There are 33 station id values in this set.
- The second query returns the COOP STATION ID values from the station metadata table that are not in the set of noaa_station_id values from the weather records. The except operator enables this kind of data processing. There are seven records in this results set.
- The third query returns station id values from the weather records data source that are not in the COOP STATION ID values from the station metadata table. This query also uses the except operator, but the order of the results being compared are in the reverse order of those in the second query. There are three records in this results set.
- The three records in the third results set need to be concatenated with the thirty-three records in the first results set to arrive at a full set of thirty-six weather stations for which weather attributes are available for analysis.
-- 33 station id values are in both -- #stations_in_weather_observations and #metadata_weather_stations select noaa_station_id, state_abbr from #stations_in_weather_observations intersect select [COOP STATION ID], STATEPROV from #metadata_weather_station_IDs -- 7 station id values are in #metadata_weather_station_IDs -- but not in #stations_in_weather_observations select [COOP STATION ID], STATEPROV from #metadata_weather_station_IDs except select noaa_station_id, state_abbr from #stations_in_weather_observations -- 3 station id values are in #stations_in_weather_observations -- but not in #metadata_weather_station_IDs select noaa_station_id, state_abbr from #stations_in_weather_observations except select [COOP STATION ID], STATEPROV from #metadata_weather_station_IDs
The preceding three queries point to several interesting facts about weather stations and weather attributes.
- There are thirty-three weather stations from the station metadata table that have weather records.
- There are an additional three weather stations with identifiers that appear in the weather records but not in the station metadata table. Because the source weather stations for collecting weather records was the forty weather stations from the five most populous states, this points to the possibility that the extra three weather stations were recoded between the station metadata table and the weather records data source.
- Finally, there must be four weather stations in the original forty stations from station metadata table for which there are no weather records.
The download at the end of this tip includes T-SQL code which confirms these interesting facts. That code creates and populates a new temporary table (#for_join_of_noaa_station_id_to_COOP_STATION_ID) with all the FactStation table field values for the three stations with recoded identifiers between the station metadata table and the weather records data source.
Here’s a script to concatenate the station data from the set of thirty-three stations and the set of three stations with recoded station_id values. Additionally, the following script pumps the concatenated records from the full set of thirty-six stations into the FactStation table. Finally, the script extracts station_id and station_name column values from the FactStation table to populate the DimStation table.
-- to populate station_FactStation and DimStation tables begin try drop table for_populating_station_id_fact_table end try begin catch print 'for_populating_station_id_fact_table not available to drop' end catch begin try drop table FactStation end try begin catch print 'FactStation not available to drop' end catch begin try drop table DimStation end try begin catch print 'DimStation not available to drop' end catch -- 33 station id values are in both -- #stations_in_weather_observations and #metadata_weather_stations select STATEPROV ,[STATION NAME] ,[COOP STATION ID] noaa_station_id ,[ccsl_still_in_use].[CLIMATE DIVISION] ,[ccsl_still_in_use].[COUNTY NAME] ,[ccsl_still_in_use].[NCDC STATION ID] ,[ccsl_still_in_use].[LATITUDE] ,[ccsl_still_in_use].[lat_dd.mmss] ,[ccsl_still_in_use].[LONGITUDE] ,[ccsl_still_in_use].[long_dd.mmss] ,[ccsl_still_in_use].[ELEVATION] ,[ccsl_still_in_use].[END DATE] into for_populating_station_id_fact_table from [dbo].[ccsl_still_in_use] where [COOP STATION ID] in ( select noaa_station_id from #stations_in_weather_observations intersect select [COOP STATION ID] from #metadata_weather_station_IDs ) union -- 3 recoded station ID values select [ccsl_still_in_use].STATEPROV ,[ccsl_still_in_use].[STATION NAME] ,#for_join_of_noaa_station_id_to_COOP_STATION_ID.noaa_station_id ,[ccsl_still_in_use].[CLIMATE DIVISION] ,[ccsl_still_in_use].[COUNTY NAME] ,[ccsl_still_in_use].[NCDC STATION ID] ,[ccsl_still_in_use].[LATITUDE] ,[ccsl_still_in_use].[lat_dd.mmss] ,[ccsl_still_in_use].[LONGITUDE] ,[ccsl_still_in_use].[long_dd.mmss] ,[ccsl_still_in_use].[ELEVATION] ,[ccsl_still_in_use].[END DATE] from [dbo].[ccsl_still_in_use] left join #for_join_of_noaa_station_id_to_COOP_STATION_ID on [ccsl_still_in_use].STATEPROV = #for_join_of_noaa_station_id_to_COOP_STATION_ID.state_abbr where -- Latitude and longitude coordinates for three weather stations -- from station metadata table with recoded station ID values ([lat_dd.mmss] = 26.4105 and [long_dd.mmss] = -080.0558) or ([lat_dd.mmss] = 32.2438 and [long_dd.mmss] = -099.4056) or ([lat_dd.mmss] = 42.4450 and [long_dd.mmss] = -073.4757) -- create and populate FactStation table select * into FactStation from for_populating_station_id_fact_table order by noaa_station_id -- display FactStation table select * from FactStation -- create populate DimStation table select noaa_station_id station_id ,[STATION NAME] station_name into DimStation from FactStation order by station_id -- display DimStation table select * from DimStation
Compiling the FactWeather table
The FactWeather table is a lot more straightforward to compile than the FactStation table. The FactWeather table concatenates via union operators weather records for each of the five states. There are eight columns in FactWeather table.
- Six of the columns (from prcp through awnd) are for weather attributes. When an attribute, such as awnd, is not available for a state, then it is assigned a value of null. Otherwise, the weather attribute is derived from the weather attribute value from the source table.
- The noaa_station_id and date columns are reference columns.
- The noaa_station_id column value denotes the weather station from which weather attributes were observed. The noaa_station_id column value can point to any of the thirty-six station_id values in the DimStation table.
- The date column value indicates when a weather attribute was observed. The date column values can range from January 1, 2016 through December 31, 2019. However, if there is no weather attribute observation from a weather station on a date, then row for that station_id/date combination is missing from table.
-- this code is for populating the FactWeather table begin try drop table FactWeather end try begin catch print 'FactWeather table not available to drop' end catch go select * into FactWeather from ( select noaa_station_id ,[date] ,tmax ,tmin ,prcp ,snow ,snwd ,null awnd from ca_daily_weather_from_station_txt_flds where noaa_station_id in (select distinct noaa_station_id from ca_daily_weather_from_station_txt_flds) union select noaa_station_id ,[date] ,tmax ,tmin ,prcp ,snow ,snwd ,awnd from fl_daily_weather_from_station_txt_flds where noaa_station_id in (select distinct noaa_station_id from fl_daily_weather_from_station_txt_flds) union select noaa_station_id ,[date] ,tmax ,tmin ,prcp ,snow ,snwd ,null awnd from il_daily_weather_from_station_txt_flds where noaa_station_id in (select distinct noaa_station_id from il_daily_weather_from_station_txt_flds) union select noaa_station_id ,[date] ,tmax ,tmin ,prcp ,snow ,snwd ,awnd from ny_daily_weather_from_station_txt_flds where noaa_station_id in (select distinct noaa_station_id from ny_daily_weather_from_station_txt_flds) union select noaa_station_id ,[date] ,tmax ,tmin ,prcp ,snow ,snwd ,awnd from tx_daily_weather_from_station_txt_flds where noaa_station_id in (select distinct noaa_station_id from tx_daily_weather_from_station_txt_flds) ) for_FactWeather order by noaa_station_id, date -- display the FactWeather table select * from FactWeather order by noaa_station_id, date
Compiling the DimDate table
The following script creates, populates, and displays values from the DimDate table for the weather data warehouse.
- The code starts by specifying @StartDate and @number_of_years local variables for the start date and number of years over which the date column has row values in the DimDate table.
- Next, a common table expression uses the local variables to allow for creating and populating of the DimDate table over a designated number of years from the following select statement. The list items in the following select statement designate expressions for the other columns besides date for the DimDate table.
- Finally, another select statement displays the column values for the rows in the DimDate table.
-- code to populate DimDate table -- based on dbo being the default schema for noaa_data begin try drop table DimDate end try begin catch print 'DimDate table not available to drop' end catch -- specify local variables to designate number of years over which -- DimDate extends -- number of years in date dimension is 4, and start date is January 1, 2016 declare @number_of_years int = 4, @StartDate date = 'January 1, 2016'; -- @cutoffdate is @number_of_years year(s) after start date declare @CutoffDate date = DATEADD(DAY, -1, DATEADD(YEAR, @number_of_years, @StartDate)); -- create and populate DimDate table ;WITH seq(n) AS ( SELECT 0 UNION ALL SELECT n + 1 FROM seq WHERE n < DATEDIFF(DAY, @StartDate, @CutoffDate) ), d(d) AS ( SELECT DATEADD(DAY, n, @StartDate) FROM seq ) select d date ,datename(weekday,d) dayname ,datepart(weekday,d) daynumber ,datename(month, d) monthname ,month(d) monthnumber , case when month(d) >=1 and month(d) <=3 then 'Q1' when month(d) >=4 and month(d) <=6 then 'Q2' when month(d) >=7 and month(d) <=9 then 'Q3' when month(d) >=10 and month(d) <=12 then 'Q4' else null end quarter -- can also use datename function instead, such as -- select 'Q'+(select datename(q, d)) quarter ,year(d) year into DimDate from d ORDER BY d OPTION (MAXRECURSION 0); -- display DimDate table select * from DimDate
Creating the primary and foreign keys for the data warehouse
This tip for a weather data warehouse includes two Dim tables and two Fact tables in its design. See the arrows in the data warehouse diagram in the "Overview of the weather data warehouse model" section for a picture denoting the relationships between tables. This section shows T-SQL code for setting the relationships between tables in the data warehouse model. In particular, this section presents code for creating primary keys in Dim tables and foreign keys in Fact tables that support the relationships between warehouse model tables.
- The DimStation table includes a variable length character field of up to fifteen characters, station_id, for a station identifier. The use of a character data type is derived from NOAA conventions for identifying stations. Station_id column values can serve as a primary key for the DimStation table.
- The FactStation table tracks weather stations with a broader array of attributes, such as state and county jurisdiction as well as latitude and longitude coordinates. By having a foreign key in the FactStation table that references the station_id value in DimStation table, you can ensure there are no orphan rows in the FactStation table.
- The weather attributes for the data warehouse in this tip track weather on a daily basis from January 1, 2016 through December 31, 2019. The DimDate table includes dates for the designated date range as well as other columns for classifying dates by day of week, month, and quarter. Because each row in DimDate must be unique by date, the date column serves as a convenient primary key for the DimDate table.
- Weather attributes, such as tmax and tmin for maximum and minimum daily temperature, are stored in the FactWeather table. Weather attributes are recorded at a weather station on a calendar date. Therefore, foreign keys can point back to the station_id column in the DimStation table as well as the date column in the DimDate table.
Here’s a T-SQL script for building primary and foreign keys described in the immediately preceding bullets.
-- adding primary and foreign keys to the data warehouse use noaa_data go -- set up primary key for DimStation and foreign key for FactStation -- set up station_id as a non-null column in DimStation alter table [dbo].[DimStation] alter column station_id nvarchar(15) not null -- set up non-null station_id column as a primary key alter table [dbo].[DimStation] add constraint pk_DimStation_station_id primary key clustered(station_id) -- set up noaa_station_id as a non-null column alter table [dbo].[FactStation] alter column noaa_station_id nvarchar(15) not null -- set up non-null noaa_station_id as foreign key alter table [dbo].[FactStation] add constraint fk_FactStation_noaa_station_id foreign key(noaa_station_id) REFERENCES dbo.DimStation(station_id) ON DELETE CASCADE ON UPDATE CASCADE ------------------------------------------------------------------------------ -- set up primary key for DimDate -- set up date as non-null column alter table DimDate alter column date date not null -- set up date as a primary key alter table [dbo].[DimDate] add constraint pk_DimDate_date primary key clustered(date) -- set up noaa_station_id and [date] columns as non-null columns alter table [dbo].[FactWeather] alter column noaa_station_id nvarchar(15) not null -- set up [date] as a non-null column alter table [dbo].[FactWeather] alter column [date] date not null -- set up noaa_station_id as a foreign key for FactWeather alter table [dbo].[FactWeather] add constraint fk_FactWeather_noaa_station_id foreign key(noaa_station_id) REFERENCES dbo.DimStation(station_id) ON DELETE CASCADE ON UPDATE CASCADE -- set up noaa_station_id as a foreign key for FactWeather alter table [dbo].[FactWeather] add constraint fk_FactWeather_date foreign key(date) REFERENCES dbo.DimDate(date) ON DELETE CASCADE ON UPDATE CASCADE
Next Steps
The data for this tip is built on weather records for each of five states and a station metadata table created in the prior tip Gathering Information for a Weather Data Warehouse in SQL Server. You can get the data for these tables from the download file available with that tip.
The download file for this tip includes all the code shown in this tip for creating and populating a weather data warehouse in SQL Server. Additionally, this tip’s download file includes a couple of other code segments that handle disconnects between station identifiers in the station metadata table and the weather records data source. These code segments are critical for making the code shown in the tip run correctly, but they are not of general interest to the topics of creating or populating a weather data warehouse in SQL Server. Finally, the download file for this tip includes several sample queries for unit testing the populated data warehouse model. If you choose to implement a weather data warehouse model like the one in this tip, you will likely find these unit testing queries of value.
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips