By: Matteo Lorini | Comments | Related: More > R Language
Problem
I have a basic knowledge of R, is it possible to manipulate data inside an R dataframe using regular T-SQL syntax?
Solution
R has a package called sqldf that allows developers to manipulate data inside a dataframe in the same way a SQL developer, queries a SQL table. In this tutorial I will show how to install the package and how to use it to query some values from the sample AdventureWorks2014 database.
Step 1: First, we need to install the sqldf package. This is simply done by the following command:
install.packages("sqldf")
Step 2: Use the AdventureWorks2014 database, Sales.SalesOrderHeader and Sales.SalesTerritory tables to create our sample query.
SELECT T.[Name], convert(varchar(20),[OrderDate],110) as OrderDate, [SalesOrderID],[CustomerID], S.[TerritoryID],[SubTotal],[TaxAmt],[TotalDue] FROM Sales.SalesOrderHeader AS S INNER JOIN [Sales].[SalesTerritory] AS T ON T.TerritoryID = S.TerritoryID
Step 3: Create an R script in RStudio that connects to SQL, reads the data returned from the above query into an R dataframe structure.
library(RODBC) library(dplyr) sqlcnt <- odbcDriverConnect('driver={SQL Server Native Client 11.0};uid=xxxxxxxxx;pwd=xxxxxxx;server=xxxxxxxxx;database=AdventureWorks2014;rows_at_time=1024') sqlQry <- "SELECT T.[Name], convert(varchar(20),[OrderDate],110) as OrderDate, [SalesOrderID],[CustomerID], S.[TerritoryID],[SubTotal],[TaxAmt],[TotalDue] FROM Sales.SalesOrderHeader AS S INNER JOIN [Sales].[SalesTerritory] AS T ON T.TerritoryID =S.TerritoryID" # Store data into our adwdf dataframe adwdf <- sqlQuery(sqlcnt, sqlQry, stringsAsFactors = F,as.is = T) # To view our data View(adwdf)
Step 4: Finally, we have our data loaded into a dataframe, we can show how easy it is to work with dataframe in a SQL like manner. Let us start with a simple query that lists all the Canadian orders:
#adwdf is our DataFrame library(sqldf) sqldf("select * from adwdf where Name = 'Canada'")
Using T-SQL code.
SELECT T.[Name], convert(varchar(20),[OrderDate],110) as OrderDate, [SalesOrderID],[CustomerID], S.[TerritoryID],[SubTotal],[TaxAmt],[TotalDue] FROM Sales.SalesOrderHeader AS S INNER JOIN [Sales].[SalesTerritory] AS T ON T.TerritoryID =S.TerritoryID WHERE [Name] = 'Canada'
Next, let us aggregate some data using a group by clause
sqldf("select name, sum(SubTotal) from adwdf group by name")
Using T-SQL code.
SELECT T.[Name], sum([SubTotal]) FROM Sales.SalesOrderHeader AS S INNER JOIN [Sales].[SalesTerritory] AS T ON T.TerritoryID =S.TerritoryID GROUP BY [Name]
Lastly, let us query our dataframe to look for Total Due greater than one million.
sqldf("select name, OrderDate , sum(SubTotal) SubTotal,sum(taxAmt) taxAmt,sum(TotalDue) TotalDue from adwdf group by name, OrderDate having sum(TotalDue) >1000000")
Using T-SQL code.
SELECT T.[Name],OrderDate, sum(SubTotal) SubTotal,sum(taxAmt) taxAmt,sum(TotalDue) TotalDue FROM Sales.SalesOrderHeader AS S INNER JOIN [Sales].[SalesTerritory] AS T ON T.TerritoryID =S.TerritoryID GROUP BY [Name],OrderDate HAVING sum(TotalDue) >1000000
Conclusion
The R sqldf package is versatile and we can use it to create ways to filter data in a dataframe using familiar T-SQL syntax.
Next Steps
- The reader will need to install RStudio in order to test this tip.
- R Tutorial
- Check out these tips
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips