Learn more about SQL Server tools

mssqltips logo
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

 

Importing Historical Stock Prices from Yahoo Finance into SQL Server


By:   |   Read Comments (4)   |   Related Tips: More > Import and Export

New Whitepaper > 10 Ways We Can Steal Your Data >> download now


Problem

Please demonstrate a clean end-to-end solution for importing historical stock ticker prices from Yahoo Finance into a SQL Server table. I need to be able to easily change the date range for which to harvest prices. I want the solution to accommodate at least 150 ticker symbols, and I want to be able to easily change the ticker symbols as well as add new ones.

Solution

Data mining historical stock prices is a very popular topic (as judged by the amount of feedback that I get from tips on this topic). Many trading firms and individual investors believe that they can devise strategies for winning stock trades. However, before you can mine historical stock ticker prices, you need to harvest and store them where you can evaluate stock selection and trading strategies. SQL Server is great at storing stuff, including historical ticker prices, but it is less well suited to interface directly with Google Finance and Yahoo Finance to download historical prices for lots of tickers. This tip demonstrates how to use Python for harvesting historical ticker prices from Yahoo Finance and inserting the collected prices into SQL Server.

Collecting historical stock prices from Google Finance for SQL Server with Python was addressed in this prior tip. It was discovered during the preparation of the tip that a changed Google Finance URL defeated, probably on a temporary basis, Python's ability to modify programmatically date ranges for collecting historical stock prices. Instead of reflecting start and end date parameters, Google Finance always returned data for one year prior to the collection date.

You can also use Yahoo Finance with a Python program as a source for inserting historical stock prices into SQL Server. As of the time this tip is prepared, Yahoo Finance and Python work successfully together so that you can programmatically change the date range over which prices are collected. There are subtle differences in the ways Yahoo Finance and Google Finance return historical prices. As a consequence, you are likely to encounter issues if you do not make some modifications when porting Python code for Google Finance to Yahoo Finance. This tip

  • provides Python code that works around special Yahoo Finance issues
  • presents easy-to-follow T-SQL code for populating a SQL Server table with Python output for historical ticker prices; the prior tip's T-SQL code was not as simple to follow

MSSQLTips.com previously covered a couple of non-Python based approaches to gathering historical stock prices for storage in SQL Server from Google Finance.

  • One approach involved manually inserting a URL in a browser to download historical prices in a csv file from Google Finance for an individual ticker. Then, the downloaded csv files for a set of individual tickers can be imported into SQL Server with SSIS.
  • A second approach depended on manually managing a set of Google Sheets to interface with Google Finance to export csv files for a set of tickers. This approach eliminates the need to manually specify a ticker and date range for each time that you wanted to obtain historical prices for a specific stock because the tickers and date range are saved in a Google Sheet. However, there are still manual steps for publishing sets of csv files for each Google Sheet of tickers. Also, there is a performance impact for adding tickers to a Google Sheet so that the effective limit is about 25 tickers per Google Sheet. This performance impact means that you need an additional sheet for each set of 25 tickers that you want to track.

Both prior approaches involved manual steps that programmers might easily consider tedious, inelegant or both. The approach presented in this tip uses just two scripts. A Python script is used to collect historical ticker prices from Yahoo Finance. A T-SQL script launches the Python script and populates a SQL Server table with the historical prices. Therefore, you can download historical prices for as many different historical stocks as your needs dictate without any manual steps. Just put the stock tickers in a txt file (or modify the stock tickers in a txt file) and run one T-SQL script!

Setting up to use Python with SQL Server

Python is a very popular programming language because of its ease of use and flexibility. Microsoft even enabled the running of Python scripts in SQL Server 2016 and 2017 versions. Also, MSSQLTips.com previously presented a tip that shows how to securely run Python scripts from any version of SQL Server that can run xp_cmdshell extended stored procedures; these versions go back at least until SQL Server 2000. In case you need step by step guidelines for installing Python, you can get these from this tip.

One important reason Python is so powerful is because of the availability of a large set of external libraries that add functionality not provided by the built-in Python library. The pandas_datareader external library facilitates collecting historical prices from both Google Finance and Yahoo Finance. Before you can use an external library, you must

  • install it on a computer
  • reference it from a Python script

The process for installing and referencing the pandas_datareader library is described in this tip. The same general process applies to installing and referencing other Python built-in and external libraries - some of which are referenced in the current tip.

A Python script for collecting historical prices for over a hundred tickers from Yahoo Finance

The Python script for downloading historical stock prices from Yahoo Finance relies on a txt file with a list of stock ticker symbols. As you may know, a stock ticker symbol (sometimes just called a ticker) is a short-hand way for referencing the name of a security. The download files for this tip include two txt files of tickers.

  • long_stock_symbol_list.txt has 150 ticker symbols. In general, the tickers were selected because the prices for the stocks showed some tendency for appreciation within the months leading up to the time this tip was prepared.
  • short_stock_symbol_list.txt has 20 ticker symbols within it. All its tickers were derived from the preceding file. You may find this shorter list of tickers useful for quickly unit testing changes to the Python code and its interface with the T-SQL code. I found the shorter file of stock tickers helpful for these purposes.

The following Python code listing from the read_long_symbol_list_from_2007_and_download_stock_prices_with_try_catch_any.py file shows the script for downloading historical stock prices for the long list of tickers. Lines starting with # denote comment lines. Comments mainly serve two purposes.

  • They offer commentary on portions of the script immediately following the comment lines.
  • They illustrate how to modify the script to use the short ticker list and a shorter date range for historical prices that permit faster unit testing of code changes.

Here's a detailed walk-through of the code segments within the following script. It is meant to help you understand every element of the script.

  • The three code lines at the top of the script enable two Python settings for controlling how the print statement operates within the script. The print statement generates output that is ultimately consumed by SQL Server.
    • Both option settings depend on the pandas external library for Python. Recall that the Setting up to use Python with SQL Server section references a tip for installing external libraries, such as pandas. The pandas library is a different library than the pandas_datareader library.
      • After a library is installed, you need to import it into each script in which you want to use it.
      • The pd at the end of the import line for the pandas library designates an alias name within the script for the library.
    • The display.max_rows option setting controls the number of rows that display when you print the rows of a dataframe. A dataframe in Python code is roughly analogous to a table in T-SQL code.
      • By default, a dataframe shows 60 rows of output. If there are more than 60 rows of output, Python displays 30 rows from the beginning of the dataframe and 30 rows from the end of the dataframe with a separator between the two ends.
      • The specification of None for the display.max_rows option setting results in a print statement showing all rows for a dataframe. In the context of the current demonstration, this setting causes the output from the script to display available historical stock prices for all tickers over the date range from the start date through the end date -- even when the date range extends over 10 years!
    • The expand_frame_repr option setting can enable the print command to display all column values for a dataframe in a single, undivided set of columns. The default display mode is to show column values in multiple sets of columns (one set after the other set) if the width of the columns exceed the implicit width of a page.
      • It is easier to consume the Python output within SQL Server when all column values for a dataframe appear as single, undivided set of columns.
      • Assigning a value of False to the expand_frame_repr option setting enables the column values to display as a single, undivided set of columns.
  • The next block of three code lines references three more libraries. Two libraries are internal libraries and one is external.
    • The datetime and date libraries are internal Python libraries for representing date and time values. Notice that there are different syntax examples for importing the datetime library versus the date library.
      • The basic way to reference a library for use in a Python script is with an import statement; the datetime library is referenced this way.
      • When one library is nested within another (as is the case for the date library nested in the datetime library), the correct syntax for referencing the nested library is to use a from keyword followed by the name of the nesting library, which is, in turn, followed by the import keyword followed by the name of the nested library.
    • The pandas_datareader library is an external library; this library is for remote data access to a set of internet sources, including Yahoo Finance and Google Finance. The alias web is assigned to the data class within the pandas_datareader library. The script below demonstrates the basics along with error trapping for usage of the library.
  • The next code block demonstrates the use of a Python list object named symbol for storing ticker values read from the long_stock_symbol_list.txt file. If you need historical prices for more or different stocks than the 150 tickers in this file, just add new lines or replace existing lines in the file.
    • An open statement opens the txt file.
    • Then, a for loop iterates through each line in the file.
    • Within each iteration, a strip function peels off the carriage return code trailing the stock ticker designation before insertion into the list object.
  • After the tickers are read from a file into a list object, another code block includes expressions for specifying the start and end dates through which to collect historical stock prices.
    • The expression for the start date value assigns a starting date of January 1, 2007 as a Gregorian date constant in year, month, day format to the date class. The starting date expression designates a date that is more than 10 years prior to the preparation date for this tip (around the end of October 2017).
    • The expression within the line commencing with #start illustrates how to use a different formulation for setting the start date for testing purposes.
      • To replace the uncommented line commencing with start by the line commencing with #start, just move the leading # sign from one line to the other.
      • The today method for the date class returns the current local date.
      • If you do trade which line is commented, the number of historical prices will be just five or less per ticker value.
        • date.today().year returns the current local year
        • date.today().month returns the current local month
        • date.today().day-5 returns the current local day up to 5 trading days from today
      • The reason the alternate start date can yield less than five historical prices is because the today method returns calendar days, but stock markets do not operate on all calendar days. Weekend days as well as selected holidays are days on which stock exchanges in the United States do not trade stocks. Therefore, if the value of date.today().day-5 falls on a weekend day or a stock exchange holiday, then there is no historical stock price for that day from the exchange.
  • The final code block in the script below uses a while loop to iterate through all the ticker values in the symbol list, which is based on ticker values from the long_stock_symbol_list.txt file.
    • Ignoring the try, except, and continue statements within the while block, there are two types of paths through the while code block.
      • One type of path is for an invocation of the datareader method of the data class within the pandas_datareader library that does not throw an exception. This is the path that the code follows when it returns historical prices for a stock ticker.
      • A second type of path is for an invocation of the datareader method of the data class within the pandas_datareader library that does throw an exception.
      • In unit testing the code sample, I discovered several different types of exception errors that caused the code within the loop to abort. The most common exception was for a RemoteDataError, but this was not the only type of exception.
    • The try, except, and continue statements below catch exceptions of any type and cause the code to operate again (continue) until a no-exception run through the while loop is the result for a ticker symbol.
      • In this way, the code returns historical prices for all legitimate ticker values in the long_stock_symbol_list.txt file.
      • The except statement traps any type of exception error because there is no reference to a specific exception, such as a RemoteDataError exception.
      • When an exception is trapped for the pass through the loop, the code denotes the occurrence of an exception path through the loop by inserting in the output "No information for ticker:" followed symbol[i] value.
      • In scores of unit tests for the code, I never encountered an exception for a legitimate ticker that did not get resolved automatically after a relatively short number of re-runs, such as 1, 2, or 3 passes, through the while loop.
      • Here's a warning: an illegitimate ticker in the symbol list will cause the Python code to loop without end. Therefore, you should double check all ticker values before using them in a production environment.
    • The datareader method of the data class within the pandas_datareader library returns a Python dataframe object named f in the script. When using the datareader method to return historical prices, the method takes four arguments.
      • The first argument is for a ticker value.
      • The second argument is yahoo for Yahoo Finance or google for Google Finance.
      • The third argument is for the first calendar date for which a set of historical prices are returned; if an exchange is closed on that day, then historical prices start on the first trading day after the start date.
      • The fourth argument is for the last calendar date through which historical prices are returned. If the end date is for a day on which an exchange is closed, then historical prices are returned through the last trading day before the end date.
    • The f dataframe returned by the datareader method is manipulated by both insert and drop methods.
      • Within the script, the insert method is used to add a new column to the dataframe in the first position after the date column.
        • The dataframe's date column is an index column for the dataframe.
        • The inserted column has the name Symbol.
        • The value for the inserted column is the value of symbol[i] within the current pass through while loop.
      • The drop method can be used to drop either a column or a row from a dataframe; the axis parameter with a value of 1 directs the method to drop a column from a dataframe.
        • The name of the column to be dropped by the method in the script below is Adj Close.
        • One goal of the script below is to return historical stock prices that match the format of those from Google Finance; this goal dictates the dropping of the Adj Close column because this column is not reported for historical stock prices by Google Finance.
    • The print statement for the f dataframe writes the values from the modified dataframe. After the date and symbol columns, the column names are for open, high, low, and close prices as well as volume (number of shares traded).
  • The final statement in the while loop increments the value of the i index by 1. When this index value exceeds the number of tickers in the symbol list, iterations through the loop are terminated and the script ends.
#settings to dynamically display all rows and columns
#in a single tabular display of the dataframe
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('expand_frame_repr', False)

#settings for importing built-in datetime and date libraries
#and external pandas_datareader libraries
import time
import datetime
from datetime import date
import pandas_datareader.data as web

#read symbols from file to python list
#optionally, read from short_stock_symbol_list.txt for testing
symbol = []
with open('C:\python_programs\long_stock_symbol_list.txt') as f:  
    for line in f:
        symbol.append(line.strip())
f.close

#setting for start and end dates
#you can designate a specific date
#such as 2007, 1, 1
#or a relative date such as today less 5 days

start = datetime.date(2007, 1, 1)
#start = date(date.today().year, date.today().month, date.today().day-5)

end = datetime.date.today()

#iterate over ticker symbol values
#with while for retrieving historical and
#displaying stock prices
#try code and catch any exception
#try again (continue) after catch
i=0
while i<len(symbol):
    try:
        f = web.DataReader(symbol[i], 'yahoo', start, end)
        f.insert(0,'Symbol',symbol[i])
        f = f.drop(['Adj Close'], axis=1)
        print (f)
    except :
        print("No information for ticker:" )
        print (symbol[i])
        continue
    i=i+1

Selected historical prices from the Python script

The following screen shot shows an excerpt from the Python script output with historical prices for the JOBS ticker above corresponding historical prices from the Yahoo Finance user interface. For your easy reference, you can find basic Yahoo Finance user interface information for the JOBS ticker at this url. Use the Time Period calendar on the web page to adjust the date range as you prefer. If you change the date range do not forget to click the Apply button to revise the display for your changed date range setting. There are a couple of points that you can learn about the way the Python script works by carefully examining these two excerpts.

  • Notice that there is a message on two lines towards the top of the Python output. The first line reads: "No information for ticker:", and the second line reads: "JOBS".
    • Recall that this message means an attempt to get historical prices for the JOBS ticker failed.
    • This can be for a reason as simple as Yahoo Finance not being ready to provide JOBS historical prices at the time of the request or another conflicting process blocking the fulfillment of the request for JOBS historical prices
    • The output confirms that the script tried again; and the next attempt was successful at getting historical prices.
    • Recall that the re-try is automatic based on the code in the Python script.
  • The historical prices from both the Python script and the Yahoo Finance user interface are for dates from January 3, 2007 through January 16, 2007.
    • The Python script output starts at January 3, 2007 at the top, but the Yahoo Finance user interface output ends with January 3, 2007 at the bottom.
    • In addition, the user interface historical prices include values in left-to-right order for Open, High, Low, Close, and Adj Close prices. In contrast, the Python script output omits the Adj Close prices because the Python script drops the Adj Close column returned to its dataframe from Yahoo Finance.
    • Aside from these two differences, the historical prices (and even the volumes) are the same from the Python script versus the Yahoo Finance user interface.
    • The matching values confirm that although the initial attempt to get historical prices for the JOBS ticker failed, the second try returned historical ticker prices and those values matched corresponding results from the Yahoo Finance user interface.
Inserting_Historical_prices_from_Yahoo_fig_1

The top part of the next screen shot shows several of the most recent historical prices for the JOBS ticker as of the time the Python script was run followed by the earliest three historical prices for the AAOI ticker. The AAOI ticker follows the JOBS ticker in the long_stock_symbol_list.txt file. Therefore, this screen shot reflects the output during the transition from the first ticker to the second ticker.

The bottom part of the next screen shot shows the most recent JOBS historical prices from the Yahoo Finance user interface. Just below the next screen shot is another excerpt from the Yahoo Finance user interface. This screen shot shows the three earliest historical prices for the AAOI ticker. There are a several points that you can learn about the Python script from studying the following two screen shots.
  • As with the previous screen shot, the most recent JOBS historical prices from the Python script exactly match those from the Yahoo Finance user interface when you compare prices by date. The initial exception error reported by the Python script for getting JOBS historical prices did not have an impact on the return of valid prices in the second try that match the prices from the Yahoo Finance interface.
  • There are two heading rows in the Python script output that separate the end of the JOBS historical prices from the start of the AAOI historical prices.
    • The first row contains column headings from Symbol through Volume.
    • The second row contains Date on the left border for the index column above the AAOI dataframe values.
  • Please note that the dates for the earliest three AAOI prices run from September 26, 2013 through September 30, 2013. September 28 and 29 in 2013 are weekend days. The initial public offering date of AAOI shares was September 26, 2013, which is well after the start date in the Python script code (January 1, 2007). Python does not assign NULL values for dates without historical prices.
  • By comparing the last three rows of the next screen shot to the three rows in the screen shot after the next one, you can verify Open through Close AAOI historical prices are the same in the Python script output and the Yahoo Finance user interface.
Inserting_Historical_prices_from_Yahoo_fig_2
Inserting_Historical_prices_from_Yahoo_fig_3

The final screen shot for Python output shows the first eight sets of ENTG historical prices. The recovery of historical prices for this ticker on this run of the script took four tries in order to obtain results from the script that matched the Yahoo Finance user interface. The first three tries failed and returned a pair of "no information" message lines for each failure before the fourth try succeeded. However, the output from the fourth try returned historical prices that matched perfectly those in the Yahoo Finance user interface.

As you can see, the automatic re-try feature in the Python script is a very useful feature for consistently obtaining historical prices from the script that match those from the Yahoo Finance user interface. I tested the script many times, and the script does not always fail 1, 2, or 3 times for the same tickers on each test. Sometimes a ticker immediately returns valid results, but for other test script runs the same ticker may require one or several tries. No matter if or when a failure occurs, the Python code provides a reliable way of obtaining historical prices from Yahoo Finance.

Inserting_Historical_prices_from_Yahoo_fig_4

T-SQL script for consuming Python script output and inserting it in a SQL Server table

There are two steps for inserting the output from the Python script into a SQL Server table.

  • First, you can run the Python script with T-SQL code in SQL Server Management Studio and capture the output from the Python script in a staging table within SQL Server.
  • Second, you can parse the date, ticker symbol, as well as open, high, low, and close prices from each line containing these values in the staging table and save the extracted values in a second SQL Server table with one column for each distinct kind of value.
This section presents the T-SQL code to perform each of these steps. The completed solution is available from the following file: extracted and casted prices starting in 2007 from long symbol list from yahoo.sql.

Up until this point in the tip, the Python script output was viewed in a shell created by the IDLE application that ships with Python 3.6. However, Python script files can also be run from a Windows cmd shell. When this happens, the Python script returns its output to the Windows cmd shell.

You can open a Windows cmd shell with T-SQL from SQL Server Management Studio and invoke a Python script file. In turn, you can transfer the Python script output from the Windows cmd shell to a staging table in SQL Server. The Setting up to use Python with SQL Server section gives links for several prior tips that

  • help you to install Python for use with SQL Server
  • provide guidelines about good security practices for running Python scripts with the xp_cmdshell extended stored procedure in SQL Server
This next T-SQL segment assumes a basic familiarity with how to invoke Python scripts as described in the links referenced within the Setting up to use Python with SQL Server section. I urge you to review these links unless you are already comfortable running Python scripts from within SQL Server Management Studio.

The code segment starts by configuring SQL Server so that it can run the xp_cmdshell extended stored procedure. By default, SQL Server disables this feature; it is a good security practice to leave the feature disabled except precisely when you need to take advantage of it.

Next, the code segment creates a fresh instance of a temporary table name #Result. This table stores the raw output from the Python script file for historical ticker prices from Yahoo Finance. Notice there is just one column in the #Result table named line. This column has a varchar data type. Another script excerpt focusing on the extraction and casting of Python dataframe column values to SQL Server table column values is described later in this tip.

After the #Result staging table is created, the xp_cmdshell extended stored procedure is used to invoke indirectly the Python script file for gathering historical prices by running a Windows batch file that invokes the Python script file. This file contains the script for creating the raw Python output with historical ticker prices. Selected excerpts from this output are displayed in the preceding section. The output from the xp_cmdshell extended stored procedure is, in turn, stored in the #Result temporary table.

The final step re-configures SQL Server back to its default state for the xp_cmdshell extended stored procedure. This final step disables the capability to invoke a xp_cmdshell extended stored procedure from within SQL Server Management Studio.

-- for use by Python dev team members

-- enable the xp_cmdshell stored procedure
EXEC sp_configure 'xp_cmdshell', 1
RECONFIGURE

--------------------------------------------------------------------------------------

-- create a fresh temp table (#Result) for storing unprocessed result set from 
-- running xp_cmdshell for Windows batch that invokes
-- python script
if exists (
 select * from tempdb.dbo.sysobjects o
 where o.xtype in ('U') 
 and o.id = object_id(N'tempdb..#Result')
)
BEGIN
  DROP TABLE #Result; 
END

CREATE TABLE #Result
(
  line varchar(500)
)

--------------------------------------------------------------------------------------

-- insert python script output into #Result temp table
INSERT #Result exec xp_cmdshell 'C:\python_programs\"read_long_symbol_list_from_2007_and_download_stock_prices_with_try_catch_any.bat"'

--------------------------------------------------------------------------------------

-- disable the xp_cmdshell stored procedure
EXEC sp_configure 'xp_cmdshell', 0
RECONFIGURE
GO

By running a simple T-SQL statement (select * from #Result), you can return all rows from the #Result staging table. The #Result table was populated after Yahoo Finance finally updated its historical prices for October 27, 2017. You can properly compare results from this section to those from the preceding section for the read_long_symbol_list_from_2007_and_download_stock_prices_with_try_catch_any.py file by understanding that the prior section's screen shots were taken prior to Yahoo Finance updating historical prices for October 27, 2017.

The following screen shot shows the first seventeen rows from #Result staging table.

  • The first four rows contain output from the Windows shell environment that are not part of the Python script output.
    • Rows 2 and 4 contain the two commands in the read_long_symbol_list_from_2007_and_download_stock_prices_with_try_catch_any.bat file. As you can see, these commands change the directory to one used for storing Python system files (C:\Program Files\Python36\) and then uses python.exe to invoke the read_long_symbol_list_from_2007_and_download_stock_prices_with_try_catch_any.py file, which resides in another directory (C:\python_programs\).
    • Rows 1 and 3 are Windows content that gets returned as NULL values by the xp_cmdshell extended stored procedure.
  • Just as with the run from the prior section, the attempt to recover historical prices for the JOBS ticker failed on its first try. However, a second try succeeded.
  • Also, the historical price values in this section are identical from January 3, 2007 through January 16, 2007 to the historical price values in the prior section.
Inserting_Historical_prices_from_Yahoo_fig_5

The next screen shot shows the historical prices from January 3, 2007 through January 12, 2007 for the ENTG ticker from the #Result table. This screen starts by showing the most recent two historical values for the EDN prices followed by the column header for non-index columns and the index column header. The EDN ticker is the one before the ENTG ticker in the long_stock_symbol_list.txt file.

  • The most obvious difference for the ENTG results in this section from the preceding section is that the attempt to get ENTG historical prices succeeded on the first try for populating the #Result table; recall that the first three attempts from the preceding section failed before succeeding on a fourth try. This outcome demonstrates that failed attempts to recover historical prices by ticker are not consistent across different runs of a Python script.
  • However, the historical prices in this section are identical to those in the preceding section. Therefore, this outcome also confirms again that a failed earlier try to recover historical prices for a ticker does not impact the validity of a successful attempt to recover historical prices.
Inserting_Historical_prices_from_Yahoo_fig_6

The final screen shot showing #Result staging table values shows the last several rows of the table.

  • The very last row is for Windows content that the xp_cmdshell extended stored procedure returns as a NULL value; this last row does not include any missing or corrupted historical prices.
  • The other rows in the screen shot display contents for the YY ticker, which is the final ticker in the long_stock_symbol_list.txt file.
  • Notice that final date is for October 27, 2017; recall that this screen shot is for results taken after Yahoo Finance finalized historical prices for October 27, 2017. The screen shots from the preceding section were taken before the finalization of historical prices for October 27, 2017. Therefore, the last row of historical prices for a ticker from the preceding section is October 26, 2017.
Inserting_Historical_prices_from_Yahoo_fig_7

If you review again the preceding screen shots in this section, you can easily confirm historical prices are only present for rows starting with a date that also include a ticker value. Therefore, we can use these criteria of (1) rows starting with a date and (2) rows with a symbol after the date to extract and process only rows with historical prices from the #Result table.

In addition to excluding rows that do not start with a date followed by a ticker symbol, it was also necessary to filter out around 100 more rows out of over 350,000 total rows from the #Result table. These additional omitted rows were for rows with NaN values for all historical prices. An NaN value is the way Yahoo Finance (and Google Finance) denotes missing data. Visual examination of rows with an NaN value indicated that if a row had one NaN value for a historical price, then all the historical prices on the row had NaN values. Rows omitted because they contained NaN values for historical prices were for just four tickers out of the 150 tickers for which data was collected. The tickers with some omitted rows because of NaN values were: CAI, CC, MOMO, and SINA. It may be worth pointing out that there were plenty of valid historical prices for these four tickers.

The following T-SQL code excerpt converts #Result table rows with all the values in one string per row to another table with separate columns for date, ticker value, as well as open, high, low, and close historical prices. The new table with separate columns has the name #Results_with_extracted_casted_values. As the new table name implies its values are in separate columns and they are cast to valid data type values.

  • The date value is represented with a date data type.
  • The ticker is represented with a varchar data type.
  • The historical prices are represented with a money data type.

After removing any prior version of the #Results_with_extracted_casted_values table, the code excerpt runs a set of nested queries to populate via the into keyword the #Results_with_extracted_casted_values table. The code excerpt contains an outer query and five nested sub-queries.

  • The outer query assigns casted historical price column values to the #Results_with_extracted_casted_values table. The outer most query is described in more detail after reviewing all the nested sub-queries
  • The five nested sub-queries progressively extract or set-up to extract separate column values from the string value on each row of the #Result table. The nested sub-query names from the inner most to the outer most are:
    • extract_date_and_symbol
    • params_for_open
    • params_for_high
    • params_for_low
    • params_for_close
  • The extract_date_and_symbol sub-query performs these functions
    • It filters rows so that only rows meeting all three criteria for valid historical prices are included; these criteria are
      • The first 10 characters of the row must be a date
      • There must be a string value on the line starting with values from A through Z
      • The line must not contain a NaN value
    • There are two expressions in a select list for extracting the date and ticker values from each #Result row with valid historical prices
    • Additionally, the select list in the extract_date_and_symbol sub-query also references a column of line values from the #Result table
  • The params_for_open sub-query performs these functions
    • It preserves the date and ticker values derived by the extract_date_and_symbol sub-query
    • Then, it uses string expressions for adding two more columns to its result set
      • One expression named open_1 finds the row position in the line column value for the first position of the open historical price
      • The second expression named open_plus_1 finds the row position in the line column value for the first blank after the open historical price on a row
    • Additionally, the select list in the params_for_open sub-query also includes a column of line values from the extract_date_and_symbol sub-query result set
  • The params_for_high sub-query performs these functions
    • It preserves the date and ticker values derived from the params_for_open sub-query
    • It uses a substring function in an expression based on line, open_1, and open_plus_1 column values to extract a value for the open historical price on a row
    • Then, it uses string expressions for adding two more columns to its result set
      • One expression named high_1 finds the row position in the line column value for the first position of the high historical price
      • The second expression named high_plus_1 finds the row position in the line column value for the first blank after the high historical price on a row
    • Additionally, the select list in the params_for_high sub-query also references a column of line values from the params_for_open sub-query result set
  • The params_for_low sub-query performs these functions
    • It preserves the date, ticker, and open values derived from the params_for_high sub-query result set
    • It uses the substring function in an expression based on line, high_1, and high_plus_1 column values to extract a value for the high historical price on a row
    • Then, it uses string expressions for adding two more columns to its result set
      • One expression named low_1 finds the row position in the line column value for the first position of the low historical price
      • The second expression named low_plus_1 finds the row position in the line column value for the first blank after the low historical price on a row
    • Additionally, the select list in the params_for_low sub-query also references a column of line values from the params_for_high sub-query result set
  • The params_for_close sub-query performs these functions
    • It preserves the date, ticker, open, and high values derived from the params_for_low sub-query result set
    • It uses the substring function in an expression based on line, low_1, and low_plus_1 column values to extract a value for the low historical price on a row
    • Then, it uses string expressions for adding two more columns to its result set
      • One expression named close_1 finds the row position in the line column value for the first position of the close historical price
      • The second expression named close_plus_1 finds the row position in the line column value for the first blank after the close historical price on a row
    • Additionally, the select list in the params_for_close sub-query also references a column of line values from the params_for_low sub-query result set
  • The outer most query references the params_for_close sub-query result set in its from clause
    • It preserves the date and ticker values from the params_for_close sub-query result set
    • Next it casts open, high, and low historical price values from the params_for_close sub-query result set as a money data type
    • Then, it uses a substring function based on line, close_1, and close_plus_1 column values to extract a close historical price; the substring function is nested in a cast statement to convert the close historical price string value to a money data type value
    • Finally, the into clause populates the #Results_with_extracted_casted_values table with the result set from the outer most query
-- extract and cast stock open, high, low, close prices 
-- from long symbol list starting in 2007 to present
begin try
drop table #Results_with_extracted_casted_values
end try
begin catch
print '#Results_with_extracted_casted_values was not deleted'
end catch

-- extract close and add to date, symbol, open, high, low
-- cast open, high, low, close as money

select
 date
,symbol
,cast([open] as money) [open]
,cast(high as money) high
,cast(low as money) low
,cast(substring(line, close_1, close_plus_1 - close_1) as money) [close]
into #Results_with_extracted_casted_values
from
(
-- extract low and add to date, symbol, open, high
-- get parameters for extracting close
select
 date
,symbol
,[open]
,high
,substring(line,low_1,low_plus_1 - low_1) low

,low_plus_1
 +
 patindex('%[0-9]%',substring(line,low_plus_1,500))
 -1
 close_1
,charindex
 (
 ' '
 ,line
 ,low_plus_1
 +
 patindex('%[0-9]%',substring(line,low_plus_1,500))
 -1
 ) close_plus_1
,low_plus_1

,line

from
(
-- extract high and add to date, symbol, open
-- get parameters for extracting low
select
 date
,symbol
,[open]
,substring(line,high_1,high_plus_1 - high_1) high
,
 high_plus_1
 +
 patindex('%[0-9]%',substring(line,high_plus_1,500))
 -1
 low_1
,
 charindex
 (
 ' '
 ,line
 ,high_plus_1
 +
 patindex('%[0-9]%',substring(line,high_plus_1,500))
 -1
 ) low_plus_1
,line
from
(
-- extract open and add to date, symbol
-- get parameters for extracting high
select
 date
,symbol
,substring(line,open_1,open_plus_1-open_1) [open]
,
 open_plus_1
 +
 patindex('%[0-9]%',substring(line,open_plus_1,500))
 -1
 high_1
,charindex
 (
 ' '
 ,line
 ,
 open_plus_1
 +
 patindex('%[0-9]%',substring(line,open_plus_1,500))
 -1
 ) high_plus_1
,line
from
(
-- get parameters for extracting open
select 
 date
,symbol
,charindex(' ',line, patindex('%[A-Z]%',line))
  + 
  patindex('%[0-9]%',substring(line,charindex(' ',line, patindex('%[A-Z]%',line)),500)) 
  -1 [open_1]
,
 charindex
 (
 ' '
 ,line
 ,
 (
 charindex(' ',line, patindex('%[A-Z]%',line))
  + 
  patindex('%[0-9]%',substring(line,charindex(' ',line, patindex('%[A-Z]%',line)),500)) 
  -1 )
 ) open_plus_1
,line

from
(
-- get date and symbol for returned data
select 
 cast(substring(line,1,10) as date) [date]
,cast(substring(line,patindex('%[A-Z]%',line),charindex(' ',line, patindex('%[A-Z]%',line)) - patindex('%[A-Z]%',line)) as varchar(10)) [symbol]
,line
from #Result
where 
isdate(substring(line,1,10)) = 1   -- line must start with a date
and patindex('%[A-Z]%',line) > 0   -- line must have a symbol value
and line not like '%Nan%'          -- no NaN line value
) extract_date_and_symbol
) params_for_open
) params_for_high
) params_for_low
) params_for_close

order by symbol, date


The following screen shot shows historical prices from the #Results_with_extracted_casted_values table for JOBS ticker trading days from January 3, 2007 through January 16, 2007. The key point to note is that each kind of value from date through close price is in its own separate column. The results below match those from the source Python script displayed in the Selected historical prices from the Python script section. The money data type can have up to 4 digits after the decimal point.

Inserting_Historical_prices_from_Yahoo_fig_8

Similarly, the following historical prices for the YY ticker show results from the #Results_with_extracted_casted_values table for the YY ticker for trading days from October 19, 2017 through October 27, 2017 perfectly match those from the raw #Result table values. This confirms that the extraction and casting places values in separate columns without changing the values initially retrieved.

Inserting_Historical_prices_from_Yahoo_fig_9

Next Steps

  • Install Python on your computer if it is not already installed. See this tip for detailed instructions on how to install Python in a Windows environment for use by all users.
  • Next, install the pandas_datareader external reader library to your Python installation. See this tip for detailed instructions about installing the pandas_datareader external library for use by all Windows users on a computer. Critically, this tip also describes how to configure SQL Server accounts and a credential so that the xp_cmdshell extended stored procedure can be run without elevated permissions. This same tip also discusses how to configure a Windows standard account for a SQL Server credential to deny selected permissions for the account, such as denying the permission to delete files in a folder.
  • Next, download the files for this tip into a c:\python_programs directory on the computer running SQL Server.
  • Then, run the T-SQL script in the extracted and casted prices starting in 2007 from long symbol list from yahoo.sql file from the c:\python_programs directory. It should generate a result set with historical prices for 150 ticker symbols. The historical prices for tickers will start as early as the first trading day in October 2007 through the trading day with the most recent historical prices.
  • As mentioned earlier in this tip, it is easy to add or change the ticker values for which historical values are harvested from Yahoo Finance. If you have a couple of stock tickers you want included beyond the 150 starter ticker symbols add them to the long_stock_symbol_list.txt file. Then, verify that your newly added tickers are among those for which historical prices are reported. Sample testing scripting code is included towards the bottom of the extracted and casted prices starting in 2007 from long symbol list from yahoo.sql file in three simple script statements.
  • If you like string parsing, then try to improve the T-SQL for extracting and data typing columns values from the raw output in the #Result table to the date, ticker, and open through close historical prices in the #Results_with_extracted_casted_values table. For example, you may want to include Volume among the values from Yahoo Finance that are extracted and casted. If you need any clarifications about how the current extracting and data typing code works beyond what is in this tip, please leave a request for a clarification in a comment on this tip.


Last Update:


signup button

next tip button



About the author
MSSQLTips author Rick Dobson Rick Dobson is a Microsoft Certified Technical Specialist and well accomplished SQL Server and Access author.

View all my tips





Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Notify for updates 


SQL tips:

*Enter Code refresh code     



Saturday, November 18, 2017 - 10:46:55 AM - Chris Back To Top
Tip Comments Pending Approval

Saturday, November 18, 2017 - 6:04:38 AM - Chris Back To Top

 Hello Rick,

Thanks for your reply and interest in my product.

I am creating a powerpoint slides to show you the complete procedure. How my datafetcher works.

I'll get back to you soon with the details.

Regards

 


Friday, November 17, 2017 - 11:17:36 AM - Rick Dobson Back To Top

 Hey Chris,

Thanks for the feedback.

I have several more extended tips on this topic that will appear soon in MSSQLTips.com.

If you blogged or written about your work about 100,000+ symbol work, I would very much like to get to see it.  Please reply with references to the development strategies and/or publicly available databases with historical prices for tickers.

Rick Dobson

 


Friday, November 17, 2017 - 5:22:52 AM - chris Back To Top

Hello.

 

Thanks for sharing complete and detailed method for python. We have done same thing in codeigniter. Which is currently working over 100,00 symbols and fetching historical data directly to our DB. If i can help someone with it. I would be lucky.

 

Thanks again Great Work. Highly Appreciated  

 


Learn more about SQL Server tools