Import Data from PDF to SQL Server


By:   |   Updated: 2020-12-14   |   Comments   |   Related: More > R Language


Problem

In this article we cover how to import data from a PDF file into a SQL Server table with R. We will use an example of past lottery winning numbers to see how you could use R to load the data and possibly predict the next set of winning numbers.

Solution

Lottery winning numbers can be manually downloaded in a PDF file one at a time. In this article we will import the Mega Millions winning numbers from a PDF file into a SQL table. This is just a simple exercise that shows how powerful the combination of R and SQL in SQL Server can be.

Import PDF into SQL Server from Georgia Lottery

The first step is to download the winning numbers from the official Georgia lottery website.

From the website, we select the Mega Millions lottery game and click on download to get the numbers.

mega millions

Please notice that if the winning numbers are on multiple pages, we will have to download each page.

megaball

Once we have the winning numbers in PDF files, we can use R to extract information like Date, Winning Numbers, and Megaball, and import them into a SQL Server table for further analysis. The image below shows the contents of the lottery pdf file.

megaball

Read PDF File and Extract Information with R

Let see how we can read the pdf file and extract the information it contains. First, we need to install the R pdftools package in order to use the pdf_text R function. As usual, we will be using RStudio to execute our R scripts. The code below reads the PDF file and splits each line according to the "\n" character (Line Feed).

install.packages("pdftools")
library(pdftools)
pdf_text("GA_Lottery_WinningNumbers MegaMillions.PDF") %>% strsplit(split = "\n")
mega millions

At first glance, we see that the pdf_text function can correctly read the pdf file.

Examine Data Type Once Import into R Data Frame

Let us examine our data type once imported into an R data frame. First, we import our pdf file into a data frame (gal) and split it by "\n" (Line Feed Character). When we issue the str() R command, we see that the type of data returned is an R list data type.

gal <- pdf_text("GA_Lottery_WinningNumbers MegaMillions.PDF")
gal1 <- strsplit(gal, "\n")
str(gal1)
winning numbers

Check the Data Structure Result in R

In order to work with the data frame data, unlist() the gal1 data frame and check the final data structure result.

gal2 <- unlist(gal1)
str(gal2)
terminal

Print Data Frame in R

Manually extract the information from our data frame gal2.

print(gal2)
mega millions

We can access our data one line at a time:

print(gal2[3])
print(gal2[4])
r terminal

We can use R function substr() to access its information like date, Megaball number, and so on.

print(gal2[4])
# Get 4th element date
print(substr(gal2[4], 1,10))
# Get 4th element Megaball Number
print(substr(gal2[4], 40,42))
terminal

Finally, we can count the positions of all the data we need to insert into our SQL table.

print(gal2[4])
# Get 4th element date
print(substr(gal2[4], 1,10))
# Get Winning Numbers
print(substr(gal2[4], 12,13))
print(substr(gal2[4], 15,16))
print(substr(gal2[4], 18,19))
print(substr(gal2[4], 21,22))
print(substr(gal2[4], 24,25))
# Get 4th element Megaball Number
print(substr(gal2[4], 40,42))
console terminal

R Loop to Extract Position of Numbers

Now that we know the positions of our numbers, we can write a loop to extract them all in one execution.

seq <- c(3:21)
 
for( i in seq) {
  # Get 4th element date
  print(substr(gal2[i], 1,10))
  # Get Winning Numbers
  print(substr(gal2[i], 12,13))
  print(substr(gal2[i], 15,16))
  print(substr(gal2[i], 18,19))
  print(substr(gal2[i], 21,22))
  print(substr(gal2[i], 24,25))
  # Get 4th element Megaball Number
  print(substr(gal2[i], 40,42))
}
terminal

Insert Data into SQL Server with R

Finally, we can write the final R code to insert our data into SQL. Create the table as:

CREATE TABLE GALottery (
   LotteryDate date,
   Num1 int,
   Num2 int,
   Num3 int,
   Num4 int,
   Num5 int,
   MageMillion int
)
create table

We now need to create a connection to our SQL database and the code to loop through the pdf files values. We then insert them into a SQL table.

sqlcnt <- odbcDriverConnect('driver={SQL Server Native Client 11.0};uid=UsreName;pwd=UserPWD;server=SQLServer;database=YourDatabase;rows_at_time=1024')
 
seq <- c(3:21)
 
for( i in seq) {
  # Get 4th element date
  date <- substr(gal2[i], 1,10)
  # Get Winning Numbers
  Num1 <- substr(gal2[i], 12,13)
  Num2 <- substr(gal2[i], 15,16)
  Num3 <- substr(gal2[i], 18,19)
  Num4 <- substr(gal2[i], 21,22)
  Num5 <- substr(gal2[i], 24,25)
  # Get 4th element Megaball Number
  MgMil <- substr(gal2[i], 40,42)
  # Prepare our T-SQL statement
  rgQuery <- sprintf("insert dbo.GALottery values('%s','%s','%s','%s','%s','%s','%s')",date,Num1,Num2,Num3,Num4,Num5,MgMil)
  gmt <- sqlQuery(sqlcnt,rgQuery, stringsAsFactors = F,as.is = T)
  # Print our T-SQL - Verification only
  print(rgQuery)
}
sql database

We check our target SQL table to verify that our numbers have been entered correctly. Execute the query below in SQL Server Management Studio.

select * from dbo.GALottery
results
Conclusion

In this tip we have learned how to read data from a pdf file and insert the results into a SQL table. We used the following R functions:

  • pdf_text to read a pdf file into a data frame.
  • strsplit( x, "\n") to split the values inside our data frame by "\n" (Line Feed) character.
  • substr( x, start, end) to extract substring value; our numbers.
  • Unlist() to transform the list into a vector.
  • Sprintf() to create our T-SQL syntax. Please note that the sprintf syntax is the same as the C function sprintf
Next Steps


Last Updated: 2020-12-14


get scripts

next tip button



About the author
MSSQLTips author Matteo Lorini Matteo Lorini is a DBA and has been working in IT since 1993. He specializes in SQL Server and also has knowledge of MySQL.

View all my tips



Comments For This Article





download





Recommended Reading

Using Simple Linear Regression to Make Predictions

Data Structures in R including Vector, Matrix, Array, List, and Data Frame

sqldf in R Example for SQL Server

Index Numbers Explained with Examples in R

Exploring Four Simple Time Series Forecasting Methods with R Examples














get free sql tips
agree to terms