Import Historical Stock Prices into SQL Server using Python
A prior tip demonstrated a highly secure way to extract historical stock prices for a single ticker symbol programmatically with Python from Google Finance for use inside SQL Server. However, I need the prior tip's scope expanded to perform the same task for a batch of different ticker symbols. Additionally, I need a ticker symbol added to the result set inserted into SQL Server to identify to which ticker symbol historical prices belong. Also, please provide step-by-step guidance and specific T-SQL code on how to parse strings with irregularly spaced values from Python text output into discrete columnar values within SQL Server.
Data mining stock price data and evaluating trading strategies benefit from having data available for many different stock ticker symbols. An extensive set of historical prices for many different stock ticker symbols also facilitates the application of advanced analytics for picking stocks that are likely to show favorable price movements. You can think of a ticker symbol as a short nickname for a stock. This tip presents a framework for developing a database table with historical stock prices within SQL Server for analysis based on data available from Google Finance.
This tip demonstrates an integrated process to fully automate the creation and population of a SQL Server table with historical stock prices for 150 different ticker symbols; the demonstration can be readily repeated for different batches of ticker symbols and historical ranges. Also, the tip takes advantage of Python -- a popular scripting tool for which multiple external code libraries enable the execution of many web-related functions. For example, built-in functions within the pandas_datareader external library facilitate automating the retrieval of historical stock prices from popular repositories, such as Google Finance and Yahoo Finance.
Change report from prior tip
In a prior tip, a very simple Python script was presented to extract historical prices from Google Finance for the nvda stock symbol. As you can see below, the prices are to start (depending on data availability) on January 1, 2017 and run through August 31, 2017. Research for the prior tip confirmed the successful operation of the script for several different sets of start and end dates.
import pandas_datareader.data as web import datetime start = datetime.datetime(2017, 1, 1) end = datetime.datetime(2017, 8, 31) f = web.DataReader("nvda", 'google', start, end) f.ix print(f)
During the week of starting September 10, 2017, the start and end parameter values switched from being recognized by the Google Finance site to not being recognized by the site.
- After the switch, Google Finance returned to the Python script data for the past year no matter what values were assigned to start and end parameters in the script.
- I initially noticed this switching during my testing of code for this tip.
- Additionally, another developer also reported the same issue in a post to stackoverflow.com during the week starting September 17, 2017.
- Later during the week in September 17, 2017, another developer issued a post indicating that Google Finance made a change to an access url that resulted in dates not being passed to Google Finance from the pandas datareader.
- It may take several weeks to months from late September 2017 for the pandas datareader issue with Google Finance to be resolved so that pandas datareader can successfully pass start and end dates to Google Finance.
This kind of issue occurs periodically in the Python-financial data development community. It is easy to understand why this is so because of the independent nature of contributors to the community. For example, the simple script above depends on Python and pandas_datareader developers as well as the Google Finance Web site. If any of these contributors makes a change that is not coordinated with either of the other two, then the outcome from the sample script can change. A similar issue was reported and eventually resolved at multiple levels for getting the pandas datareader to interface properly with Yahoo Finance. The initial patch fix for Yahoo Finance took about one week to appear on the stackoverflow.com site from the initial report of the issue. As you can see from the script referencing Yahoo Finance below, Yahoo Finance currently processes correctly start and end parameter values.
The issue for the Python script reported in the prior MSSQLTips.com tip does not affect the validity of the data or even the availability of recent data (up to 1 year ago). So long as you can benefit from data within the past year, Google Finance provides valid historical stock prices. I confirmed this by running the Python script from the prior MSSQLTips.com tip with both Google Finance and Yahoo Finance serving as data providers.
The screen shot below shows the initial scripts submitted to either Google Finance (on the left) or Yahoo Finance (on the right). Although the start and end specifications are identical (Jan 1, 2017 through Aug 31, 2017), the starting dates for the provided data is different in both cases. Notice the Yahoo Finance data does not start until July 1, 2017, but the data from Google Finance starts about a year before the screen shot was taken.
The next screen shot shows historical stock prices for dates appearing both in the output displays from Google Finance and Yahoo Finance. The overlapping dates are for August 4, 2017 through August 31, 2017. Notice the historical stock prices are the same to within rounding. Google Finance presents just two places after the decimal, but Yahoo Finance displays to results to six places after the decimal.
While the two preceding screen shots show several other minor points of divergence between Google Finance and Yahoo Finance, the most important substantive point is that the historical stock prices are the same for the two data providers.
Another important issue is that the formats for displaying data are different so that the exact same code for extracting historical stock price data from both Google Finance and Yahoo Finance will not work without at least minor tweaks for each source. Based on the resolution time for a prior issue like this one, it is likely the pandas datareader issue with the Google Finance will be resolved relatively quickly (maybe even before you read this tip). In any event, relatively minor tweaks that account for formatting differences in the way Yahoo Finance provides historical stock prices compared to Google Finance will allow this tip to work with output from Yahoo Finance.
Using Python to pull historical stock prices for a list of ticker symbols
The following script shows an extension of the basic script shown above for pulling historical stock prices from Google Finance. The improvements in the script below include these features.
- The display for showing the content of a dataframe is dynamically sized to be as large as required to show all output values without any missing rows. By default, a dataframe only displays a maximum of 60 rows of output - split evenly between the beginning and ending rows of a dataframe.
- Instead of just reporting historical prices for a single fixed ticker symbol, namely nvda, the script below pulls ticker symbols from a file and populates a Python list object (symbol).
- Finally, the Python script below iterates through the ticker symbols so that historical price data are output successively for each ticker symbol in the file used to populate the symbol Python list object.
Let's walk through the code below to clarify how you can use Python to extract historical stock prices for a batch of symbols from Google Finance. Before diving into the code, notice that some lines start with a hash sign (#). The hash symbol is used to denote a comment line in Python.
There are five blocks of code within the script. These perform the following functions.
- The first block sets the pandas environment to dynamically expand the number of display rows for a dataframe so that a print statement shows all rows in its output without any missing rows. The setting for this feature simplifies the code relative to code in an earlier tip because each ticker symbol can have all its output displayed in a single print statement for a single dataframe. The prior tip used a succession of four two-month blocks to display historical stock prices from January through August in 2017.
- The second code block references the datetime and datareader libraries. The datetime library is a built-in Python library, and the datareader library is an external library that requires a separate install on a computer before you can use it.
- The third block of code shows how to read a list of ticker symbols from
a file and copy them to a Python list object.
- The name symbol denotes the name of the Python list object.
- The name f denotes a file path and name with the set of ticker symbols for which to retrieve historical stock prices.
- The with…open construct opens the file denoted by the name f.
- The for loop within the with…open construct passes through each line of the file.
- The append method for the symbol object adds each line from the file to the list after using the strip function to remove new line characters after the ticker symbol on a line.
- The fourth block sets the start and end date parameters for the datareader function in the fifth block. As indicated in the Change report from prior tip section, this feature is not working as of the time I am writing this tip for the Google Finance site. The preceding section fully demonstrates the issue, and it confirms that you can successfully set the start and end date parameters for use with the Yahoo Finance site.
- The fifth code block includes a while loop for iterating over the ticker
symbols and extracting historical stock prices sequentially for each ticker
- The while loop increments an integer variable (i) value from zero through to one less than the count of ticker symbols in the symbol list object.
- The DataReader method for the web object from the pandas_datareader library extracts the historical stock prices for a ticker symbol and passes its output to the f dataframe.
- The insert method in the next line inserts the current value of the indexed symbol list object into the dataframe at the column after the close price. The insert statement assigns the name Symbol to the inserted dataframe column.
- Finally, the print statement outputs the dataframe values for the current pass through the while loop.
#settings to dynamically display all rows #that are in a dataframe import pandas as pd pd.set_option('display.max_rows', None) #settings for importing built-in datetime #and external pandas_datareader libraries import datetime import pandas_datareader.data as web #read symbols from file to python list symbol =  with open('C:\python_programs\long_stock_symbol_list.txt') as f: for line in f: symbol.append(line.strip()) f.close #setting for start and end dates #notice that the output corresponds to #a one-year look-back window instead of #a lookback window for the start and #end parameter values start = datetime.datetime(2016, 9, 1) end = datetime.datetime(2017, 9, 30) #iterate over ticker symbol values #while retrieving historical and #displaying historical stock prices i=0 while i<len(symbol): f = web.DataReader(symbol[i], 'google', start, end) f.insert(4,'Symbol',symbol[i]) print (f) i=i+1
Another way to understand how this code works is by examining the input and output from the Python script. The input consists of a .txt file with 150 ticker symbols. The following two screen shots from NotePad++ show views of the first and last ten lines; the file name is long_stock_symbol_list.txt. The NotePad++ text editor can show non-printing characters on a .txt file line as well as indicate the line number of each line in a .txt file.
All lines in the long_stock_symbol_list.txt file, except for the last line, end with a carriage return and a linefeed (CR and LF) to move to the next line. The gray area with line numbers to the left of the ticker symbols does not appear in the output; it is merely for information purposes. For example, the first line in the first screen shot is for the first line in the file. Also, the last line in the second screen shot is for the last of 150 ticker symbols.
It is useful to check if the ticker symbol values you submit are used by Google Finance (or Yahoo Finance if you are using it instead as a provider of historical prices). It sometimes happens that different providers of financial data designate the same company with different ticker symbols. One advantage of the process relying on a file of ticker symbols is that you can get the list right once and then re-use the file whenever necessary. You can also create different sets of ticker symbols that you use for different financial data providers (besides even just Google Finance and Yahoo Finance).
The next pair of screen shots show the first and last ten lines of historical stock prices output by the preceding script; the script has the name read_long_symbol_list_and_download_stock_prices_w_max_rows.py. The script was run on September 18, 2017 after Google Finance posted historical stock prices for the day. The first date for the first stock ticker symbol is about one year before September 18, 2017; the first date (September 20, 2016) shows as the first data row in the first following screen shot. The last line of output within the second screen shot is for September 18, 2017.
You can also notice from the two screen shots below that the output occurs in ticker symbol order from the long_stock_symbol_list.txt file. Notice that the first ten rows are for the JOBS ticker symbol, which is the first symbol in the long_stock_symbol_list.txt file. Also, the last ten rows of output for historical prices is for the YY ticker symbol, which is the last ticker symbol in the long_stock_symbol_list.txt file.
Running the Python script from SQL Server Management Studio
You can return the results from the read_long_symbol_list_and_download_stock_prices_w_max_rows.py Python script file by invoking the xp_cmdshell extended stored procedure for a Windows batch file that uses the Python interpreter .exe file and the script file. The Windows batch file appears in the screen shot below. A prior tip gives all the details about how to use the xp_cmdshell extended stored procedure to securely run the batch file from SSMS and receive the output from the Python script in SSMS.
The top portion of the read_long_symbol_list_and_download_stock_prices_w_max_rows_part1.sql file shows the code used for an initial run of Python script in the preceding section. Recall that the Python script pulls historical prices for trading days during the past year for 150 ticker symbols. It takes about two minutes on my development computer to run the Python script and retrieve the output into SSMS. Once the data are in a SQL Server table as a column of string values from successive invocations of the Python print command for each ticker symbol, it takes just seconds to parse the column of string values into date, money, and varchar columns in another SQL Server table.
The key steps in the script are as follows.
- Enable the xp_cmdshell for use in a SSMS session.
- Create a temporary table (#Result) for storing the output from the Python script file. Also, another temporary table that will discussed later is dropped if it already exists.
- Run an exec statement to invoke the xp_cmdshell command to invoke the read_long_symbol_list_and_download_stock_prices_w_max_rows.bat file; a screen shot with the contents of this file appears in the preceding screen shot. This exec statement returns the output from the Python script to the #Result table.
- Finally, the sql code disables the xp_cmdshell extended stored procedure. This is because of security reasons and because the Python script output is already saved in the #Result table.
-- for use by Python dev team members -- run Python script for return historical stock prices -- a list of ticker symbols from a file -- enable the xp_cmdshell stored procedure EXEC sp_configure 'xp_cmdshell', 1 RECONFIGURE -------------------------------------------------------------------------------------- -- create a fresh temp table (#Result) for storing unprocessed result set from -- running xp_cmdshell for Windows batch that invokes -- python script if exists ( select * from tempdb.dbo.sysobjects o where o.xtype in ('U') and o.id = object_id(N'tempdb..#Result') ) BEGIN DROP TABLE #Result; END CREATE TABLE #Result ( line varchar(500) ) -- conditionally drop a temp table (#Resultprocessed) for storing processed -- result set from running xp_cmdshell for Windows batch that invokes -- python script if exists ( select * from tempdb.dbo.sysobjects o where o.xtype in ('U') and o.id = object_id(N'tempdb..#Resultprocessed') ) BEGIN DROP TABLE #Resultprocessed; END -------------------------------------------------------------------------------------- -- insert python script output into #Result temp table INSERT #Result exec xp_cmdshell 'C:\python_programs\read_long_symbol_list_and_download_stock_prices_w_max_rows.bat' -------------------------------------------------------------------------------------- -- disable the xp_cmdshell stored procedure EXEC sp_configure 'xp_cmdshell', 0 RECONFIGURE GO
The following two screen shots show the first ten lines of output returned to the #Results table followed by the last ten lines from the table.
Selected output from the first screen shot are summarized below.
- Lines 2 and 4 provide feedback about what files are used to generate output and where they are invoked.
- Lines 5 and 6 provide header information about the output for the first stock ticker symbol (JOBS).
- Lines 7 through 10 contain historical open, high, low, and close prices along with the stock ticker symbol for the prices (and volume, which is not used in this tip).
The output from the second screen shot below includes the last ten lines of output returned to #Result by the exec command in the preceding T-SQL script.
- The first nine of these lines are the last historical prices for the YY ticker symbol as of when the T-SQL script was run.
- The NULL on line 37349 is the final line returned to the #Result table by the exec command in the preceding script.
The following queries show the bottom portion of the read_long_symbol_list_and_download_stock_prices_w_max_rows_part1.sql file. These queries are presented to help you understand how to validate other runs of the T-SQL that you may invoke with different ticker symbols.
- There are 37349 rows of output in the #Result table. However, as you have seen already, not all of these rows contain valid historical prices.
- In order to have valid historical prices, a row must start with a date in the first ten positions of the line. The second query shows how to filter for rows like these (with the help of an implicit string-to-date conversion). Of the 37349 rows of output, 37044 rows are eligible to have valid historical prices.
- However, Google Finance may provide valid historical prices for some but not all types of prices on a specific date. When Google Finance does not have a valid historical price for a date, Python represents the historical price with a string value of NaN. This representation does not convert to a money value. There are 72 lines with one or more NaN values on them.
- By applying both the date and NaN filters, you can derive a result set that contains valid values for each type of price from open through close. There are 36972 rows like this in the #Result table.
-- Queries to validate outcome -- 37349 lines in total select * from #Result -- 37044 lines with that may have historical prices select * from #Result where isdate(substring(line,1,10)) = 1 -- 305 lines with no historical prices -- 4 top lines + 1 bottom line -- 150 lines with date or price headers -- for each of 150 ticker symbols select * from #Result where isdate(substring(line,1,10)) = 0 -- 72 lines with one or more invalid historical select * from #Result where isdate(substring(line,1,10)) = 1 and line like '%NaN%' -- 36972 rows with dates -- and no NaN historical prices select * from #Result where isdate(substring(line,1,10)) = 1 and line not like '%NaN%'
Guidelines and T-SQL code samples for parsing values from strings returned by Python
In my experience, the most fool-proof way of extracting values from Python output lines copied to a table like #Result is to crawl along a line following known rules about how values are positioned on the line.
- For example, the date column values in our output is always in the first 10 positions of a row if there is a date on that line. The previous section illustrates how to restrict your processing to just rows that start with a date.
- After extracting the date for a line, you can restrict your focus from the
eleventh character position to the end of the line. One useful trick is to use
a substring function to get all characters on a line after the date when looking
for the first historical price - namely, the open price.
- Initially, find the position of the first numeric character after date to find the position of the first character in the open price field.
- Next, using the position of the first character as an anchor, search for the first blank character after the first character in the open field.
- The open field value resides in its first numeric character to just before the first blank character after the first numeric character.
- Once you have extracted the open field value for a row, you can use the
first blank character after the open price field as an anchor to find the first
numeric character of the high price field value.
- Then, use the first numeric high price field value as a reference point to discover the position of the first blank field character after the high price field value.
- You can extract the high price field value as the monetary value of the characters between the first numeric character after the open price and the position just before first blank value after the start of the high price field value.
- The general guideline is to continue crawling the line until you have extracted all values needed for your requirement.
The following code excerpt from the read_long_symbol_list_and_download_stock_prices_w_max_rows_part2.sql file shows how to extract the date field value from a line. Each row in the query's result set has two values - one for line and another for date.
- The line value is the initial value in the #Result row.
- The date value is based on a substring expression that is nested in a cast
- The substring expression extracts the characters from the first ten positions on the line.
- The cast function converts the data type of the extracted values from a string to money.
- The two filters in the where clause restrict the selection of rows for processing to just those that are likely to have valid date and historical price values for transformation and saving in a SQL Server table.
-- get date from line select line ,cast((substring(line,1,10)) as date) [date] from #Result where isdate(substring(#Result.line,1,10)) = 1 -- to avoid lines with no historical price values and #Result.line not like '%NaN%' -- to avoid trying to parse a non-numeric string to a number
Here's a sample of the first eight rows of the result set from the preceding query. You can view the original row value in the first column from the result set. The extracted value appears in the second column from the result set.
The next three code blocks provide a model for extracting a field value like the one for open price from a string returned by the Python script.
- The first of the three following code blocks extracts three new fields useful
for extracting the open price in the line value for a row. These new fields
have the names
- The second code block is commented so that it does not execute. The code in the second code block is pseudo code that it is relatively easy to read; the code explicitly references field expressions developed in the preceding code block (and optionally other preceding code segments). The pseudo code is meant to convey an overview of how to extract the open price value from a line value without the distractions of detailed T-SQL code.
- The third code block is a copy of the second code block with detailed T-SQL code from the first code block (and optionally other preceding code segments) replacing the field names referenced in the pseudo code.
Extracting open price values is relatively simple because it is the first field extracted after the date field, and the date field is always in a known location - namely, the first ten characters of a line value. For the ticker symbols and dates used in this tip, the open price can have one of three different lengths:
- #.## denotes a length of 4 positions
- ##.## denotes a length of 5 positions
- ###.## denotes a length of 6 positions
No matter which length an open price value is, the field's value always starts with a number. The position of this first number for the open price value is derived by the start_of_open_in_line expression from the first code block. Every open price value ends with one or more spaces. Therefore, the code in the second and third blocks use the first blank space after the start_of_open_in_line expression value in a line to mark the end of the open price value; the expression for the first blank space has the name space_after_open_in_line. A substring function plucks the open price value from the line value based on the start_of_open_in_line and space_after_open_in_line expressions defined in the first code block.
-- get start of open in line -- and space after open in line select line ,cast(substring(line,1,10) as date) [date] ,substring(line,11,500) substring_after_date ,10+patindex('%[0-9]%',substring(line,11,500)) start_of_open_in_line ,charindex(' ',line,10+patindex('%[0-9]%',substring(line,11,500))) space_after_open_in_line from #Result where isdate(substring(#Result.line,1,10)) = 1 -- to avoid lines with no historical price values and #Result.line not like '%NaN%' -- to avoid trying to parse a non-numeric string to a number /* -- pseudo code for open after date select line ,cast((substring_for_date) as money) [date] ,cast ( substring ( line ,start_of_open_in_line , (space_after_open_in_line) - (start_of_open_in_line) ) ) as money [open] from #Result where isdate(substring(#Result.line,1,10)) = 1 -- to avoid lines with no historical price values and #Result.line not like '%NaN%' -- to avoid trying to parse a non-numeric string to a number */ -- code for open after date select line ,cast(substring(line,1,10) as date) [date] ,cast ( substring ( line ,10+patindex('%[0-9]%',substring(line,11,500)) ,(charindex(' ',line,10+patindex('%[0-9]%',substring(line,11,500)))) -(10+patindex('%[0-9]%',substring(line,11,500))) ) as money) [open] from #Result where isdate(substring(#Result.line,1,10)) = 1 -- to avoid lines with no historical price values and #Result.line not like '%NaN%' -- to avoid trying to parse a non-numeric string to a number
The next screen shot shows an excerpt from the result sets from the first and third code blocks described above. Notice that the second result set contains just the line value, the date value, and the open value. The open value is the first historical price in a line value. You can confirm the successful operation of the extraction code by verifying that the open field value matches the first number after date within each line value.
For the eleven rows in the screen shot below the open price is always in a fixed position. The open price values in the examples shown all have a ##.## format. However, across all 36972 rows with valid values for all historical price fields, other rows have values with #.## and ###.## formats. The preceding code is flexible enough to dynamically adjust to these variations when extracting open price values.
The following two code blocks present the first and second code blocks for extracting high price values from line values. Recall that for open price extraction the first block shows expressions that can be inserted from the first block into a copy of the second block to extract open price values in a third block of code. However, this approach did not work successfully when attempting to extract high prices. The good news is that a work-around was discovered.
Why did the approach for open price extraction not work for extracting high price column values? The simple answer as to why the prior approach was not successful is that SQL Server did not successfully compile the expressions in the initial try at a third code block. This is because the expressions for start_of_high_in_line and space_after_high_in_line were more complex than those for start_of_open_in_line and space_after_open_in_line.
As the code crawls farther along the line value, the expressions for the first character in a price field and the first blank after a price field grow more complex. This is because these expressions need to crawl farther along the line, and the number of characters to crawl is conditional on the position of prior prices on the line. Recall that you cannot necessarily discover the characters for a price field in a fixed set of line positions. Sometimes a price value can be in one set of positions for one line and another set of positions in the next line.
-- code for -- get start of high in line -- and space after high in line select line ,substring ( line ,charindex(' ',line,10+patindex('%[0-9]%',substring(line,11,500))) ,500) substring_after_open ,charindex ( ' ' ,line ,10+patindex('%[0-9]%',substring(line,11,500)) ) - 1 + patindex ( '%[0-9]%' ,substring( line ,charindex(' ',line,10+patindex('%[0-9]%',substring(line,11,500))) ,500) ) start_of_high_in_line ,charindex ( ' ' ,line , charindex ( ' ' ,line ,10+patindex('%[0-9]%',substring(line,11,500)) ) - 1 + patindex ( '%[0-9]%' ,substring ( line ,charindex(' ',line,10+patindex('%[0-9]%',substring(line,11,500))) ,500 ) ) ) space_after_high_in_line from #Result where isdate(substring(#Result.line,1,10)) = 1 -- to avoid lines with no historical price values and #Result.line not like '%NaN%' -- to avoid trying to parse a non-numeric string to a number /* -- pseudo code for high after open after date select line ,cast((substring_for_date) as money) [date] ,cast ( substring ( line ,start_of_open_in_line , ( (space_after_open_in_line) - (start_of_open_in_line) ) ) as money ) [open] ,cast ( substring ( line ,start_of_high_in_line , ( (space_after_high_in_line) - (start_of_high_in_line) ) ) as money ) [high] from #Result where isdate(substring(#Result.line,1,10)) = 1 -- to avoid lines with no historical price values and #Result.line not like '%NaN%' -- to avoid trying to parse a non-numeric string to a number */
The following screen shot presents eleven rows for the BBD ticker symbol with rows 4712 and 4713 highlighted. For row 4712, the first character for the high price is in position 21. However, the first character for row 4713 is 20. You can see that the shift in starting high price position between the two rows is because the high price went up between 4712 and 4713. In row 4712, the high price had a #.## format. When the price went up on the next trading day, the high price format changed to a ##.## format. For both rows, the first blank position in the line value after the first high price character was 25. This identical first blank position with two different starting positions confirms the price values are right justified.
A work-around for the SQL Server interpreter not being able to interpret the complex start_of_high_in_line and space_after_high_in_line expressions when copied into the pseudo code of the second block from the first block code is to use values instead of expressions. For example, you can save in a temp table (#Result_start_of_high_in_line_and_space_after_high_in_line) the values for the start_of_high_in_line and space_after_high_in_line expressions along with the source line field value. Then, you can inner join the temp table holding expression values with the #Result table by line value so that the start_of_high_in_line and space_after_high_in_line values instead of their expressions can be used when extracting the high price from a line.
After the join, the rows do not maintain the same order as before the join. Therefore, a where clause filter is added at the end of the query to extract just the line values that were formerly on rows 4712 and 4713 in the Result table. This filter can be removed if you are not trying to extract a specific two rows for comparison purposes. The following screen shot shows the same two rows as those highlighted in the preceding screen shot. Notice that the high price values are successfully extracted although they start in different positions within their line field values.
-- code for -- building and using temp table with -- start of high in line -- space after high in line -- drop prior version of temp table if it exists if exists ( select * from tempdb.dbo.sysobjects o where o.xtype in ('U') and o.id = object_id(N'tempdb..#Result_start_of_high_in_line_and_space_after_high_in_line') ) BEGIN DROP TABLE #Result_start_of_high_in_line_and_space_after_high_in_line; END -- populate temp table with result set from query select line ,charindex ( ' ' ,line ,10+patindex('%[0-9]%',substring(line,11,500)) ) - 1 + patindex ( '%[0-9]%' ,substring( line ,charindex(' ',line,10+patindex('%[0-9]%',substring(line,11,500))) ,500) ) start_of_high_in_line ,charindex ( ' ' ,line , charindex ( ' ' ,line ,10+patindex('%[0-9]%',substring(line,11,500)) ) - 1 + patindex ( '%[0-9]%' ,substring ( line ,charindex(' ',line,10+patindex('%[0-9]%',substring(line,11,500))) ,500 ) ) ) space_after_high_in_line into #Result_start_of_high_in_line_and_space_after_high_in_line from #Result where isdate(substring(#Result.line,1,10)) = 1 -- to avoid lines with no historical price values and #Result.line not like '%NaN%' -- to avoid trying to parse a non-numeric string to a number -- code for high after open after date select #Result.line ,cast(substring(#Result.line,1,10) as date) [date] ,cast ( substring ( #Result.line ,10+patindex('%[0-9]%',substring(#Result.line,11,500)) ,(charindex(' ',#Result.line,10+patindex('%[0-9]%',substring(#Result.line,11,500)))) -(10+patindex('%[0-9]%',substring(#Result.line,11,500))) ) as money ) [open] ,cast ( substring ( #Result.line ,#Result_start_of_high_in_line_and_space_after_high_in_line.start_of_high_in_line , ( (#Result_start_of_high_in_line_and_space_after_high_in_line.space_after_high_in_line) - (#Result_start_of_high_in_line_and_space_after_high_in_line.start_of_high_in_line) ) ) as money ) [high] from #Result inner join #Result_start_of_high_in_line_and_space_after_high_in_line on #Result.line = #Result_start_of_high_in_line_and_space_after_high_in_line.line where isdate(substring(#Result.line,1,10)) = 1 -- to avoid lines with no historical price values and #Result.line not like '%NaN%' -- to avoid trying to parse a non-numeric string to a number -- remove this filer if you want all the rows instead -- of just these two rows for unit testing purposes and #Result.line in ( '2017-08-04 9.89 9.94 9.84 9.90 BBD 8195203' ,'2017-08-07 9.89 10.06 9.86 10.02 BBD 8677223' )
SQL script output for all historical prices and preliminary unit test results
One challenge for getting historical stock prices collected with a Python script into SQL Server is parsing the prices from strings returned by Python and the xp_cmdshell extended stored procedure. The preceding section presents guidelines and sample code on how to extract the prices from the strings. Basically, all your code needs to do is crawl along the positions of a string and extract prices (or anything else you want such as a ticker symbol) as you find them.
In this section, you learn how to confirm that extracted prices and symbols are valid. Also, this section presents results from a solution that overcomes all discovered issues. This section presents the sample output from the complete solution along with selected unit tests of the output; the complete solution is an implemented adaptation of the process described in the preceding section. Additionally, this section shows an example of how to use extracted close prices.
The following two screen shots show the first and last ten rows of prices for Apple (AAPL) with historical prices from 2016-09-23 through 2017-09-21. The results were generated by the read_long_symbol_list_and_download_stock_prices_w_max_rows_full.sql file. The line column values are the string values returned to SQL Server from the underlying Python script. The date through symbol columns are the extracted values from the line input. By scanning the line values versus the extracted column values, you can confirm if the extraction worked successfully. The complete solution includes one row set for AAPL and 149 other row sets for the collection of other symbols. Recall that there are 150 ticker symbols in the file of ticker symbols, which matches the number or row sets returned by the solution.
It may be worth repeating that the downloading of the raw input and extraction of column values for 150 row sets into SQL Server took just about two minutes!
The last step of the code writes the results for all row sets into a global temp table named ##Resultprocessed. This global temp table is derived from a local temp table named #Resultprocessed. The read_long_symbol_list_and_download_stock_prices_w_max_rows_full.sql file populates a local temp table named #Resultprocessed using a parsing algorithm that is substantively the same as the guideline described above to transform #Result table contents into contents like those displayed in the two preceding screen shots for historical prices belonging to all ticker symbols. Then, the local #Resultprocessed table is copied to a global temp table (##Resultprocessed) for use in other queries besides the one in the read_long_symbol_list_and_download_stock_prices_w_max_rows_full.sql file.
The following script from the read_long_symbol_list_and_download_stock_prices_w_max_rows_full_unit_test.sql file verifies that the ##Resultprocessed temp table contains results for 150 symbols. You can vary the ticker symbols for which historical prices are collected by adding, deleting, and updating the contents of the long_stock_symbol_list.txt file. If the number of distinct symbol values does not match the number of symbols in the long_stock_symbol_list.txt file, then there is a problem with the processing of your symbol list or the symbol values may not be valid for the data provider.
-- 150 distinct symbols are in the result set -- 150 ticker symbols are in a source table -- referenced by a Python script for extracting historical prices select distinct symbol from ##Resultprocessed order by symbol
One motivation for compiling historical stock prices is to see how prices are performing for different stock ticker symbols. The following script also from the read_long_symbol_list_and_download_stock_prices_w_max_rows_full_unit_test.sql file displays the percentage change for stock close prices from the first date through the last date for which data is available. The result set is ordered by descending percentage change. An excerpt from result set appears in the screen shot below the code. While past performance is no guarantee of future performance, some stock traders believe that the "trend is your friend". Those believing in this maxim may be inclined to purchase and/or hold stocks that are in the top 10 for percentage change.
-- Find percentage change in close price from earliest date to most recent date -- display the results for ticker symbols in the order of percentage gain select * ,round(((most_recent_stock_price.[close]/earliest_stock_price.[close]) - 1)*100,2) close_percent_change from ( -- close price on eariest date select ##Resultprocessed.symbol ,##Resultprocessed.[date] ,##Resultprocessed.[close] from ##Resultprocessed inner join ( select symbol, min([date]) earliest_date from ##Resultprocessed group by symbol ) symbol_earliest_date on ##Resultprocessed.symbol = symbol_earliest_date.symbol and ##Resultprocessed.date = symbol_earliest_date.earliest_date ) earliest_stock_price inner join ( -- close price on most recent date select ##Resultprocessed.symbol ,##Resultprocessed.[date] ,##Resultprocessed.[close] from ##Resultprocessed inner join ( select symbol, max([date]) most_recent_date from ##Resultprocessed group by symbol ) symbol_most_recent_date on ##Resultprocessed.symbol = symbol_most_recent_date.symbol and ##Resultprocessed.[date] = symbol_most_recent_date.most_recent_date ) most_recent_stock_price on earliest_stock_price.symbol = most_recent_stock_price.symbol order by round((cast(most_recent_stock_price.[close] as money)/cast(earliest_stock_price.[close] as money))*100 - 1,2) desc
- Install Python on your computer if it is not already installed. See this tip for detailed instructions on how to install Python in a Windows environment for use by all users.
- Next, install the pandas_datareader external reader library to your Python installation. See this tip for detailed instructions about installing the pandas_datareader external library for use by all Windows users. Also, set up the Windows standard account and create a SQL Server credential that points at the Windows standard account. The same tip as the one for installing the pandas_datareader library also discusses how to the configure the Windows standard account as well as how to set special permissions to deny selected permissions from the account, such as to deny the permission to delete files in a folder.
- Next, download the files for this tip into a c:\python_programs directory.
- Then, run the read_long_symbol_list_and_download_stock_prices_w_max_rows_full.sql file from the directory. It should generate a result set (with different dates) with columns like the first two screen shots displayed in the SQL script output for all historical prices and preliminary unit test results section.
- You can then run the final script from the read_long_symbol_list_and_download_stock_prices_w_max_rows_full_unit_test.sql file. Depending on how close you are to October 2017, it should generate a top ten list that is reasonably close to the one showing in the final screen shot of this article. Top ten lists are likely to change over time; the longer the time is from when a list was compiled, the greater the likelihood of change.
- Finally, change the long_stock_symbol_list.txt file by adding a stock ticker symbol that is not already there. After you run the read_long_symbol_list_and_download_stock_prices_w_max_rows_full.sql file, this should give you 151 symbols, instead of 150 symbols, in your result set when you next run the read_long_symbol_list_and_download_stock_prices_w_max_rows_full_unit_test.sql file.
About the author
View all my tips