By: Siddharth Mehta
Overview
R is an open source programming language and environment. It is most widely-used for statistical computing, statistical graphics and data science, and is one of the most popular data science languages. R has gained a wide acceptance in the statistical computing, predictive analytics and machine learning professional community. R consists of huge number of libraries for machine learning, natural language processing, domain specific data manipulation for bioinformatics finance, statistical graphics, parallel computing and more.
Microsoft acquired R last year with a vision of enabling advanced analytics within Microsoft data platforms on-premises, hybrid cloud environments and on Microsoft Azure. Post-acquisition, Microsoft has integrated R with SQL Server, PowerBI, Azure and Cortana Analytics. Also, Revolution R Open has been renamed to Microsoft R Open and Revolution R Enterprise to SQL Server R Services (SQL Server Machine Learning Services in SQL Server 2017) and/or Microsoft R Server.
SQL Server R Services / SQL Server Machine Learning Services install an open source distribution of R, as well as packages provided by Microsoft that support distributed and/or parallel processing. The architecture is designed such that external scripts using R run in a separate process from SQL Server. R Services integrates the R language with SQL Server, which helps in performing analytics close to the data and eliminate the costs and security risks associated with data movement.
Traditional data analytics methodology relies on transporting and transforming data from OLTP databases > Data Warehouses > Data Marts using PowerShell for administration, SSIS for ETL, SSAS for multi-dimensional / in-memory analytics, and SSRS for reporting. Data manipulation using set based operations and mathematical algebra has been the best possible solution with T-SQL on data stored in OLTP databases. Using R with T-SQL extends the power of data science, statistical computing, machine learning and other advanced predictive analytics capabilities to OLTP systems.
This tutorial is aimed at helping experienced T-SQL Developers, DBAs, Data Analysts and Data Science enthusiasts to start using R language inside and with T-SQL. This enables to perform data science and analytics closest to the actual data, without the need to follow traditional BI methodology of transporting and transforming data across repositories. The tutorial is structured in five lessons, with each lesson focused on explaining corresponding key points as mentioned below.
Lesson 1: R in SQL Server Ecosystem
- What R in SQL Server means for developers, DBAs, data analysts and data scientists
- R with T-SQL compared to R versus T-SQL
- Applications of R in SQL Server
- R tools in SQL Server Ecosystem
- R and SQL Server Database Engine Integration Architecture
Lesson 2: Installing Machine Learning Services
- Install SQL Server 2017 CTP2 with Machine Learning Services
- Install VS 2017 with R tools
- Explore R installation and tools
Lesson 3: Basic R Concepts
- R version, Libraries, Datasets
- Variables, Comments, Printing Data
- Arithmetic, Operators, Loops
Lesson 4: R with T-SQL
- Using sp_execute_external_scripts
- Reading data in R from T-SQL
- Returning manipulated data from R to T-SQL
Lesson 5: Graphical Analysis with R
- Creating a graphics device in R
- Generating graphs / plots using data read from tables / views
- Using R script as a stored procedure
- Useful Resources
By the end of this tutorial, you should be able to develop basic R scripts that read data from OLTP databases and apply graphical analysis.
So, let’s get started with the first lesson to understand the influence of R in SQL Server and its impact on the SQL Server community.