Weather Data Warehouse in SQL Server - Gather Information
Building a data warehouse for data analysis can be extremely helpful to look back at the past and forecast the future with the collected data. One set of data that we can store in a data warehouse is weather related data. In this tip we look at how to collect data from National Oceanic and Atmospheric Administration (NOAA) and how we can load and query this data in SQL Server.
In this article, we will demonstrate some issues and workarounds that I encountered while gathering information for a weather data warehouse in SQL Server. The source of the data is from the National Weather Service (NWS) Cooperative Observer Program and I will highlight the impact of state jurisdictions as well as latitude and longitude coordinates verifying with the collected data.
The Cooperative Observer Program (COOP) is comprised of volunteer weather observers who use standardized practices to observe and record daily surface weather, such as temperature, rain fall, snow fall, and wind. The COOP is administered by the National Weather Service, which is an organization within the National Oceanic and Atmospheric Administration (NOAA) of the Department of Commerce. COOP weather observers are from the United States as well as its territories and provinces.
There are thousands of weather stations in the COOP program, and weather observations from these stations can be complemented by other weather collection programs under the auspices of the NOAA. Two prior tips, Analyze Historical Weather Data with SQL Server and Tracking Weather Stations Geographically with SQL Server and Excel, illustrate the basics for how to gather data about COOP weather observations from COOP weather stations for populating SQL Server tables.
Weather data can serve many business applications, such as evaluating the impact of weather on insurance claims, heating oil consumption, electricity load during summer months in hot weather areas, agricultural output, and clothing items purchased from retail outlets. COOP weather observations are a prime source of data for these applications because, among other considerations, the data are available without charge.
This tip goes beyond prior tips for collecting weather data from COOP weather stations. The tip's focus is to describe and illustrate some of the issues you can encounter when gathering data for a weather data warehouse in SQL Server. A subsequent tip will demonstrate a process for populating a weather data warehouse based on the weather data imported into SQL Server from this tip.
An introduction to COOP weather data
There are two key components to COOP weather data observations.
The base component is a list of COOP weather stations. This data source maintains data about stations, such as station name, station latitude and longitude coordinates, state and county jurisdiction names for stations, as well as the end date for daily observations from a station. The NOAA describes this list as a source for station metadata. Each weather station also has a unique station id value. However, the rows for whether stations on the list can change over time.
The second major component of COOP weather data observations consists of time series values for different aspects of the weather, such as precipitation, surface air temperature, and wind. For example, a weather station can facilitate observations for weather elements, such as the maximum temperature during a day. The weather elements from a weather station can vary from one station to another as well as for the same station over time. For example, a weather station can go temporarily out of operation or an observer may not be available over a range of dates. Furthermore, the station id values for some stations can be different for the same station between the station metadata component and the weather observation component.
The list of COOP weather station metadata within this tip is available from a table named ccsl_still_in_use in the dbo schema of a database named noaa_data. You can learn more about the design of this table from article Analyze Historical Weather Data with SQL Server.
Here is a T-SQL script for returning all the rows sorted by STATEPROV values for all the columns in the ccsl_still_in_use table. Over time, station id values from any version of the ccsl_still_in_use table can go out of synchrony with the station id values from the weather observations data source. This is because weather stations can come into and go out of service during a range of dates. Beyond these two considerations, the NOAA sometimes revises the station id values for selected stations in the weather data source without a corresponding change in the station metadata source. For these kinds of reasons, you are unlikely to find a perfect match between the station id values from the ccsl_still_in_use table, which reflects metadata station list values and the station id values from weather stations.
-- return all the rows for all the columns -- in the ccsl_still_in_use select * from noaa_data.[dbo].[ccsl_still_in_use] order by STATEPROV
The following two screen shots show the first ten rows followed by the final ten rows from the preceding script's results set.
- Notice from the row indicator in the second screen shot that there are 8824 rows for weather stations that were still in use as of the time the download from the NOAA web site was implemented. Recall that the number of weather stations can change from time to time.
- There are twelve columns in the results set.
- The column names in capital letters are from the NOAA web site.
- The two column names in lower case letters (lat_dd.mmss and long_dd.mmss) are reformatted values from NOAA source columns named LATITUDE and LONGITUDE, respectively. The transformed column values clarify that the original data specifies the latitude and longitude for weather stations in degrees (dd), minutes (mm), and seconds (ss), which is one of several widely used formats for representing latitude and longitude coordinate values.
- Each row for a weather station within the weather station metadata list is identified by a COOP STATION ID value. Because weather stations are under the auspices of the NOAA, they also have a NCDC (National Climatic Data Center) STATION ID value. The NCDC is a sister agency to the NWS within the NOAA.
- STATION NAME, STATEPROV, and COUNTY NAME further contribute to the identification of weather stations. STATEPROV values are two-character USPS field values for identifying states, territories, or provinces in which a weather station is located.
- ELEVATION denotes the height of the weather station in feet relative to sea level.
- CLIMATE DIVISION code was originally developed for monitoring of drought, temperature, precipitation, and heating/cooling degree days for regions within the continental United States. A good resource to start learning about climate division code values is the U.S. Climate Divisions page at the NOAA web site.
Each COOP station can generate multiple time series. When dealing with daily observations, the period for each time series will be one day. There will typically be more than one time series for a station. For example, regarding daily temperature, there are often at least two time series (daily minimum temperature and daily maximum temperature). Other time series values associated with weather stations include rain fall, snow fall, and accumulated snow depth.
The second major resource for COOP weather observations are daily recordings of weather values at weather stations. You can choose to download weather observations as csv files, pdf files, or fixed-width text files. To start downloading weather data navigate to the initial weather search form at the NOAA web site. This tip provides a framework and sample weather observations based on the full set of NOAA search forms. The returned values are in fixed-width text files for as many as eight weather stations for each of the five most populous states based on the 2010 decennial census.
Collecting weather station metadata and COOP weather time series
This tip demonstrates three steps for collecting COOP weather time series values for storage in SQL Server. A subsequent tip will illustrate how to create, populate, and query a weather data warehouse in SQL Server based on the stored values from this tip. The steps applied in this tip include these tasks – two elaborated in this section, and a third step demonstrated in the next section.
- Specify the weather stations from which to collect weather data.
- Use the NOAA web-based search forms to designate the stations and specific weather series to download. Also, specify an email address to which to send the links for downloading the fixed-width text file with the weather data from the NOAA web site.
- Download the fixed-width text file and import it into a SQL Server table.
In this tip, you will see the steps for downloading weather data for a sample of weather stations from the five most populous states. Wikipedia denotes the five most populous states from the 2010 decennial census as: California, Texas, New York, Florida, and Illinois.
The following script returns the count of COOP weather stations for these five states.
-- stations by state for the five states -- with the largest populations by 2010 decennial Census select STATEPROV, COUNT(*) stations_in_states from noaa_data.[dbo].[ccsl_still_in_use] where STATEPROV IN('CA','TX','NY','FL','IL') group by STATEPROV order by COUNT(*) desc
This screen shot shows the results set from the preceding script.
The next script shows the T-SQL code for selecting the first four and last four weather stations from each of the five most populous stations. The station list at the NOAA web site are in ascending order by COOP STATION ID value.
- The first select statement with a top operator and no order by clause enumerates the first four weather station COOP STATION ID values for a state.
- The second select statement with a top operator and an order by clause with a desc argument enumerates the last four weather station COOP STATION ID values for a state.
- These two queries are nested within an outer query for concatenating the first and last four stations per state with STATEPROV values of: ca, tx, ny, fl, and il.
- Another outer query with union operators between the queries for each state pulls rows for all forty rows consisting of eight stations per state.
-- list top and bottom 4 weather stations located in -- each of five states with the largest populations select * from ( select top 4 * from noaa_data.[dbo].[ccsl_still_in_use] where STATEPROV = 'ca' ) ca_top union select * from ( select top 4 * from noaa_data.[dbo].[ccsl_still_in_use] where STATEPROV = 'ca' order by STATEPROV, [COOP STATION ID] desc ) ca_bottom union select * from ( select top 4 * from noaa_data.[dbo].[ccsl_still_in_use] where STATEPROV = 'tx' ) tx_top union select * from ( select top 4 * from noaa_data.[dbo].[ccsl_still_in_use] where STATEPROV = 'tx' order by STATEPROV, [COOP STATION ID] desc )tx_bottom union select * from ( select top 4* from noaa_data.[dbo].[ccsl_still_in_use] where STATEPROV = 'ny' ) ny_top union select * from ( select top 4 * from noaa_data.[dbo].[ccsl_still_in_use] where STATEPROV = 'ny' order by STATEPROV, [COOP STATION ID] desc ) ny_bottom union select * from ( select top 4 * from noaa_data.[dbo].[ccsl_still_in_use] where STATEPROV = 'fl' ) fl_top union select * from ( select top 4 * from noaa_data.[dbo].[ccsl_still_in_use] where STATEPROV = 'fl' order by STATEPROV, [COOP STATION ID] desc ) fl_bottom union select * from ( select top 4 * from noaa_data.[dbo].[ccsl_still_in_use] where STATEPROV = 'il' ) il_top union select * from ( select top 4 * from noaa_data.[dbo].[ccsl_still_in_use] where STATEPROV = 'il' order by STATEPROV, [COOP STATION ID] desc ) il_bottom
The following screen shot displays the results set from the preceding script. As you can see, it is just an extract from the full list of stations still in use. There are forty stations in the extract (eight stations for each of five states). The data are arranged in alphabetical order by STATEPROV value and then by COOP STATION ID value within a state. The other columns besides STATEPROV and COOP STATION ID simplify getting some context for the weather stations, such as the state in which a station resides and the latitude and longitude coordinates for stations.
After a set of stations for which to collect data is specified, you can submit the COOP STATION ID values to a series of search forms at the NOAA web site to collect historical weather observations for those stations. Go to the initial weather search form at the NOAA web site to start specifying the a set of historical weather that you seek. After opening the page, you can populate with data for the California weather stations in this tip.
The following screen shot shows the form for the eight California weather stations.
- Daily Summaries are selected for the weather observation type because we seek daily observations derived from the weather stations.
- The requested range of daily observations are from January 1, 2016 through December 31, 2019. This request seeks weather for each calendar day from the start date through the end date for which weather observations are available (recall that weather stations can go out of operation temporarily for multiple reasons).
- Weather stations are specified by a COOP STATION ID values; STATIONS are designated as the search criterion.
- There are eight search terms. These are the eight COOP STATION ID values for the stations in California from the preceding results set. Copy and paste a set of COOP STATION ID values from the preceding results set to populate the search term box.
Clicking Search transfers control from the initial Search form to the Search Results form. Here is what the Search Results form looks like.
- Notice that the last six characters of the Station ID field correspond to a matching value from one of the terms in the preceding form. These terms are derived from the ccsl_still_in_use table.
- Six of the eight weather stations are found in the weather dataset.
- The two weather stations for which no weather could be found have COOP STATION ID values of 049870 and 049872.
- The six weather stations for which weather could be found have COOP STATION ID values of 040010, 040029, 040136, 040144, 049884 and 049888.
- The form appears with the first three of the weather stations selected. These stations have a gray background color for their ADD TO CART button. Selecting the remaining three stations allows you to collect data from all six stations in the NOAA daily summaries weather dataset.
Click the Cart (Free Data) button to advance to the next form; this button appears towards the top right section of the map showing the location of weather stations. An excerpt from the Cart: Daily Summaries form appears in the following screen shot. For the purposes of this tip, you need to make sure the Custom GHCN-Daily Text button is selected. This selection directs the form application to display the weather data in a fixed-width text format on the NOAA web site.
Click Continue to advance to the Custom Options: Daily Summaries form. The Continue button is at the bottom right edge of the Cart: Daily Summaries form, but it is below the form excerpt. The Custom Options: Daily Summaries form permits a user to customize weather elements as well as other items in a request. Other critical data items to select include station name and geographic location in latitude and longitude coordinates.
- Choose the Station Name and Geographic Location check boxes towards the
top of the form.
- The STATION NAME fields in weather dataset and the ccsl_still_in_use table may not match perfectly.
- However, the latitude and longitude values for stations that are the same in the weather dataset and the stations in the ccsl_still_in_use table do match perfectly (within rounding error).
- This tip provides data for demonstrating this issue after downloading the fixed-width text files from a browser. A subsequent tip will verify that stations with different station names and station id values between the two datasets can have identical longitude and latitude coordinates (within rounding error).
- This tip tracks time series weather station values for various weather elements in three weather categories: precipitation, air temperature, and wind. For reasons enumerated in the preceding section, the precise weather element values available from a weather station can vary from one station to another as well as over time for the same weather station.
Clicking Continue on the preceding form page opens a web page from which you can review the weather data selections and submit the request to NOAA for processing.
- The Requested Data Review section from the review form page summarizes the selections made on prior forms.
- You can also use the page for navigating through prior pages to modify or submit your order as is.
- Before clicking the SUBMIT ORDER button, you must enter and verify an email address for the NOAA to send notices about the status of an order (if the form does not remember an email from a prior use).
- The email address that you specify will receive two messages per order.
- The first message is to confirm that processing has commenced.
- The second message is to confirm that an ordered data request is completed. You can download a fixed-width text image of the weather data to your default browser by clicking a link in the second message.
- For a set of eight stations and four years, this process can require one to several minutes to complete.
After the second email is received, the user of the email account can download the file. The following screen shot shows an excerpt from the second email message.
- The heading of the email includes a request number (it is 2198037 in the following screen shot).
- Notice the Download link in the excerpt. Clicking this link opens your default browser to the fixed-width text file in the NOAA web site.
- The View documentation link includes contents describing the field values and coding for the downloaded weather data.
To complete this tip, you need to re-run the forms for the weather stations from each state that you want to include in the dataset for populating a data warehouse or some other application. This means you need to run through the forms for each of the remaining four most populous states, including Florida, Illinois, New York, and Texas. A fresh browser page will be populated at the end of each pass through the forms with a link for downloading the requested weather data.
Importing fixed-width text files into SQL Server
After selecting the default browser's fixed-width text file view generated in response to a NOAA weather data request, you can manually execute Windows copy and paste commands to transfer the image to a NotePad++ session (or another text editor of your choice). From the text editor on your computer, you can save the copied text file to a folder on your computer. By successively applying Windows copy and paste commands for the text file images for each state, you can transfer the text file of weather data for each state to its own text file on your computer. The following screen shot from a NotePad++ session shows an image of the first ten rows of the text file with data for California. Notice that the filename is 2198037.txt; its named after the request number 2198037 for easy record keeping.
Through successive applications of Windows copy and paste commands with a follow-up save command, you can transfer the text file image for each state from your default browser to corresponding text file on your computer. The browser image is of the text file stored on the NOAA web site; the text file from the text editor is for the same weather data on your computer. The following screen shot from a NotePad++ session shows an image of the first ten rows of weather from California. Notice that the filename is 2198037.txt.
- The top panel shows the first six columns from the txt file. These columns,
except for DATE, reference weather station features.
- An especially critical column from the set is STATION, which denotes a comparable value to the COOP STATION ID value in its final six characters.
- The LATITUDE and LONGITUDE columns are also critical in that they reflect the position of the weather station from the weather element dataset, which can be compared to the position of a weather station from the station metadata list. Notice that latitude and longitude coordinate values are represented in a decimal degree format. Because this latitude and longitude format is different in the weather element dataset than the degree-minute-second format from the station metadata list, you need to transform latitude and longitude values to a common format when trying to compare them between the two data sources.
- Each row in the fixed-width text file is for the combination of a distinct date and the weather station designated by the last six characters of the STATION column value.
- The bottom panel shows the last ten columns from the text file. These columns
denote weather element observations for a weather station on a date.
- A column value of -9999 indicates a missing weather element observation.
- Each column's name denotes a different weather element on a date for
a weather station.
- For example, the PRCP value on the third data row denotes four hundredths of an inch of rain. The top panel indicates the third row is for COOP STATION ID 040010 on January 3, 2016.
- You can find the definition of weather element column names at this web page, which you can also get by clicking the View Documentation link in an email from NOAA announcing the availability of the requested weather data.
The following script illustrates an approach for transferring the 2198037.txt file to a SQL Server table named ca_daily_weather_from_station_txt_flds. Recall that the 2198037.txt file contains the data from California weather stations. The text file name is based on the number for a weather request. Each new request for a different state or for a different range of dates will have a distinct request number that is valid for the span of a couple of weeks (after that period, the request number values can be re-used by NOAA for other weather requests).
- The sample script references the noaa_data database. However, you can choose any other database that is convenient for you – just change the argument for the use statement to point at the other database.
- The block of code designated as step 1 creates a fresh copy of the #daily_weather_from_station_txt_flds
- The columns of this table correspond to the unparsed columns values from the fixed-width text file, such as 2198037.txt.
- Because different weather stations do not necessarily support observations for the same weather elements, you may have to adjust the weather element column names for the weather elements from each state.
- In fact, I had to create a separate version of the script below for the weather download for each state. One main reason for the change was to accommodate the distinct list of weather elements and/or the order of the weather elements from each state. Also, you do not have to transfer all columns from the text file to the #daily_weather_from_station_txt_flds table.
- In step 2, the downloaded text file for a state is imported into a fresh
copy of ImportedFileTable.
- This table has a separate line for each line in the fixed-width text file for a state. Therefore, the table needs to be populated by lines of text from a file like 2198037.txt, which is the name of the text file with weather elements from California in this tip.
- A BULK IMPORT statement transfers the lines of text from the text file to the ImportedFileTable SQL Server object.
- The BULK IMPORT statement references c:\noaa_data as the path for the text file, but you can update the code to reflect wherever your text file for weather elements from a set of stations is stored.
- In step 3, the unparsed text data in the ImportedFileTable object is parsed
and inserted into the #daily_weather_from_station_txt_flds temporary table.
- In the sample script below, an insert…select statement transfers parsed values from the select statement to the columns of the #daily_weather_from_station_txt_flds temporary table.
- The select statement in this sample script performs three functions
for all ten weather data elements.
- First, it assigns a value of NULL if the text string is -9999. This text value denotes a missing value in the source data.
- Second, it uses a substring function to extract a range of column values from the textvalue field for successive lines in the ImportedFileTable object.
- Third, while inserting the parsed column values to the #daily_weather_from_station_txt_flds temporary table, the text values are transformed to an appropriate data type for the weather elements. This transformation is often from a string value to a numeric type of real or int.
- Step 4 is the final operation in the script. This step copies via an into clause in a select statement the typed values from the #daily_weather_from_station_txt_flds temporary table into a permanent table named ca_daily_weather_from_station_txt_flds. By transferring the values from a temporary table to a permanent table, the values from the temporary table remain available for easy re-use in a downstream application , such as populating a weather data warehouse, that may run in a different SQL Server session.
use noaa_data go -- for ca stations -- step 1: create a fresh copy of the #daily_weather_from_station_txt_flds table begin try drop table #daily_weather_from_station_txt_flds end try begin catch print '#daily_weather_from_station_txt_flds table not available to drop' end catch go -- for string fields create table #daily_weather_from_station_txt_flds ( noaa_station_id nvarchar(15) ,elevation real ,latitude_deg_dec real ,longitude_deg_dec real ,[date] date ,mdpr real ,dapr real ,prcp real ,snow real ,snwd real ,tmax int ,tmin int ,tobs int ,wdfg int ,wsfg real ) go --------------------------------------------------------------------------------------------------- -- Read weather data from weather stations -- step 2: create and populate a fresh version of ImportedFileTable -- Remove prior ImportedFileTable if EXISTS(SELECT * FROM sys.objects WHERE name = 'ImportedFileTable' and TYPE = 'u') DROP TABLE ImportedFileTable -- Create a fresh version of ImportedFileTable CREATE TABLE ImportedFileTable( textvalue varchar(max) ) GO -- Import text file BULK INSERT ImportedFileTable FROM 'C:\noaa_data\2198037.txt' -- step 3: transfer unparsed text data to parsed columns -- in the#daily_weather_from_station_txt_flds temporary table insert into #daily_weather_from_station_txt_flds ( noaa_station_id ,elevation ,latitude_deg_dec ,longitude_deg_dec ,[date] ,mdpr ,dapr ,prcp ,snwd ,snow ,tmax ,tmin ,tobs ,wdfg ,wsfg ) select substring(textvalue,12,6) [STATON ID] ,substring(textvalue,70,10) elevation ,substring(textvalue,81,10) lat_dec_degree ,substring(textvalue,92,10) long_dec_degree ,substring(textvalue,107,2) + '/' + substring(textvalue,109,2) + '/' + substring(textvalue,103,4) [date] , case when substring(textvalue,112,8) = '-9999' then null else substring(textvalue,112,8) end MDPR , case when substring(textvalue,121,8) = '-9999' then null else substring(textvalue,121,8) end DAPR , case when substring(textvalue,130,8) = '-9999' then null else substring(textvalue,130,8) end PRCP , case when substring(textvalue,139,8) = '-9999' then null else substring(textvalue,139,8) end SNWD , case when substring(textvalue,148,8) = '-9999' then null else substring(textvalue,148,8) end SNOW , case when substring(textvalue,157,8) = '-9999' then null else substring(textvalue,157,8) end TMAX , case when substring(textvalue,166,8) = '-9999' then null else substring(textvalue,166,8) end TMIN , case when substring(textvalue,175,8) = '-9999' then null else substring(textvalue,175,8) end TOBS , case when substring(textvalue,184,8) = '-9999' then null else substring(textvalue,184,8) end WDFG , case when substring(textvalue,193,8) = '-9999' then null else substring(textvalue,193,8) end WSFG from ImportedFileTable where isnumeric(substring(textvalue,12,6))=1 order by [STATON ID],date -- step 4: save a permanent copy of the ca weather station data table for ca begin try drop table ca_daily_weather_from_station_txt_flds end try begin catch print 'ca_daily_weather_from_station_txt_flds table not available to drop' end catch go select * into ca_daily_weather_from_station_txt_flds from #daily_weather_from_station_txt_flds order by noaa_station_id, date
The next screen shot shows the first eight rows of data in the ca_daily_weather_from_station_txt_flds table.
- As you can see, the first row is for a noaa_station_id of 040010 and a date value of 2016-01-01; the eighth row is for 2016-01-08 for the same noaa_station_id value. The noaa_station_id column values represent station id values, such as COOP STATION ID values.
- Except for the noaa_station_id column, the column values are for date, real, and int data types. The noaa_station_id column has a nvarchar data type because COOP STATION ID values are string values (although they are comprised of digits from 0 through 9).
- Notice that the values in the prcp column below exactly match those in the PRCP column from the excerpt of the 2198037.txt file above (see the preceding screen shot).
You can repeat the broad outline of the above application for weather stations from California to the weather stations for the other four states in this tip. The download for this tip includes scripts that illustrate the customization of the broad outline for each of the remaining states. Additionally, the source data text files are also in the download for this tip. For at least one state, I also discovered some uncommon coding for missing values (that is, missing values were denoted by a value other than -9999). As you attempt to retrieve and use NOAA weather data, you should be looking for minor data issues as is common for most large production data sources.
- Source text file: 2200127.txt
- Output table name: fl_daily_weather_from_station_txt_flds
- Source text file: 2200159.txt
- Output table name: il_daily_weather_from_station_txt_flds
- New York
- Source text file: 2202522.txt
- Output table name: ny_daily_weather_from_station_txt_flds
- Source text file: 2202535.txt
- Output table name: tx_daily_weather_from_station_txt_flds
There are three simple steps to gathering weather data for a set of weather stations from a collection of states. This tip uses the five most populous states from the 2010 decennial census as a collection of states from which to gather weather data.
- The first step relies on the ccsl_still_in_use table within this tip. The table contains the NOAA comprehensive list of COOP weather stations that are still in use. For your easy re-use, the table contents are in a csv file named csv_for_all_rows_in_station_metadata__data_source.csv in the download for this tip. This tip's download also contains a subset with forty rows for station metadata for the five most populous states.
- The second step is to use a subset of the stations from within each of the most populous states with the NOAA web-based forms for making weather data available from COOP weather stations. Within the context of this tip, the weather downloads as a fixed-width text file. The text file for each of the five most populous states used in this tip are available in the tip's download. Also, a NOAA document is provided with documentation on the weather element values in the downloaded weather files.
- The third step transfers the weather files to a temporary table of parsed, data typed values and then to a permanent table for subsequent re-use in downstream applications. This tip's download includes a T-SQL script for extracting parsed, data typed values for the fixed-width text file for each of the five most populous states. If you choose, you can use the csv file with station metadata for all COOP stations to select weather from other locales than those covered in this tip. This tip's download includes a T-SQL script file with all the code presented in this tip as well as additional code and data files referenced in the tip.
- Download scripts and data for this article.
About the author
View all my tips
Article Last Updated: 2020-09-01