R in SQL Server Ecosystem
By: Siddharth Mehta
SQL Server 2016 introduced support for executing R scripts inside the boundaries of T-SQL script execution in the database engine. With the advent of SQL Server 2017, this support has been enhanced and baked into SQL Server in the form of Machine Learning Server and Machine Learning Services (In-database). Machine Learning Server for Windows provides parallelized big data analysis, transformation, modeling and operationalization capabilities. In-database Machine Learning Services includes extensions that enable integration with R using standard T-SQL statements.
In order to understand the influence of R on professionals in terms of the SQL Server related roles and responsibilities, first we need to understand how R is integrated into SQL Server, its application and capabilities, and whether itís replacing T-SQL tasks or will it complement T-SQL capabilities. Understanding this will make it easier to assess the influence of R on our day to day activities. So letís try to understand how R works with SQL Server 2017 (CTP2).
R and SQL Server Database Engine Integration Architecture
Before we delve into the integration architecture, letís conceive a tangible perception of R for the purpose of discussion. Consider R as an open-source programming language that has all the programming constructs like variables, loops, operators, etc. like any other programming language. One key differentiator of R is that it consists of a vast number of libraries for statistical computing, visualizations, natural language processing, machine learning, etc.
Letís say that we want to compute certain statistics in our regular T-SQL code that requires very complex logic in terms of conceptualization as well as implementation. So we decide to use R for this purpose, as there are ready-to-use libraries that can compute these values without the need to learn or implement any complex statistical formulas. So here we have a use-case of using R inside T-SQL. A stored procedure named sp_execute_external_scripts is the provision inside T-SQL to make a request for executing external scripts.
When you run R code from T-SQL using Machine Learning Services, all R scripts are executed outside the SQL Server process, to provide security and greater manageability. SQL Server has introduced new components to enable integration with R, which are mentioned below.
- Launchpad - The SQL Server Trusted Launchpad is a service provided by SQL Server 2017 for supporting execution of external scripts, similar to the way that the full-text indexing and query service launches a separate host for processing full-text queries.
- Binary Exchange Language (Bxl) Server - BxlServer is an executable provided by Microsoft that manages communication between SQL Server and the R runtime.
- SQL Satellite - The SQL Satellite is a new extensibility API in SQL Server that is provided by the database engine to support external code. BxlServer uses SQL Satellite for communicating with SQL Server.
The below architecture diagram from MSDN shows how these components work together to enable seamless integration of R with SQL Server Database Engine.
- SQL Server sends request to the Launchpad service.
- The Launchpad service starts the RLauncher.
- RLauncher starts the external R process.
- BxlServer coordinates with the R runtime to manage exchanges of data with SQL Server and storage of working results.
- SQL Satellite manages communications about related tasks and processes with SQL Server.
- BxlServer uses SQL Satellite to communicate status and results to SQL Server.
- SQL Server gets results and closes related tasks and processes.
Applications of R in SQL Server
R in SQL Server is very new and yet to witness a large scale community adoption. Considering the existing unique and differentiating applications of R in the industry, the following application on R (in no particular order) in SQL Server seems probable.
- Machine Learning
- Natural Language Processing
- Financial Data Modeling
- Bio-informatics data analysis
- Network data analysis
- Spatial and geo-statistical modeling
- Web Scraping and data ingestion
- Statistical Graphics
R with T-SQL compared to R versus T-SQL
R as well as T-SQL both are programming languages that can work with data. Both contain some common as well as unique programming constructs and capabilities. So the natural question that comes to mind is whether R can be used to replace some of the T-SQL data manipulation tasks? Will R replace some of the T-SQL features or R will complement T-SQL to extend its capabilities beyond data manipulation like machine learning, statistical modeling, etc.
R has libraries to deal with data in the way T-SQL manipulates data. R architecture natively was not built for multi-threaded, multi-processor, parallel processing. There are several parallel processing libraries for R available that allow you to explicitly run calculations in R simultaneously. Microsoft R includes a distribution of the Intel Math Kernel Library (MKL), which is used whenever possible for faster mathematical processing, compared to Rís standard Binary Linear Algebra (BLAS) library. But not all R commands make use of the BLAS, due to which those calculations wonít benefit from the parallel performance architecture. Microsoft is supposedly working on R to bake multi-processing architecture in more areas.
Among the most important additions to Microsoft R are the RevoScaleR and RevoPemaR packages. These are R packages that have been written largely in C or C++ for better performance. RevoScaleR includes a variety of APIs for data manipulation and analysis. The APIs have been optimized to analyze data sets that are too big to fit in memory and to perform computations distributed over several cores or processors. R solutions based on the RevoScaleR functions can work with very large data sets and are not bound by local memory. PEMA stands for Parallel External Memory Algorithm. The RevoPemaR package provides APIs that you can use to develop your own parallel algorithms.
So architecturally speaking it makes sense to use T-SQL for dealing with large volumes of data calculations, and use R in the area of its strength i.e. statistical modeling, machine learning, predictive analytics, etc.
What R in SQL Server means for Database developers, DBAs, Data analysts and Data scientists?
- Database developers should learn R, so that they can extend their competency beyond basic tabular data manipulation as business requirements of predictive analytics would mandate the use of R in T-SQL. R is also supported on Azure Machine Learning. Considering the penetration of analytics and cloud in the industry, it becomes very important for database developers to have the knowledge of data science languages that runs the analytics in the cloud.
- DBAs do not need to learn R in particular, but would need to learn how to monitor and govern R resource usage. Database administrator can control who has the ability to run R jobs, and who has the ability to install or share R packages. The administrator can also monitor the use of R scripts by either remote or local users, and monitor and manage the resources consumed.
- Data analysts should inevitably learn R as it has all the means to bring predictive analytics, machine learning, and related analytics on raw data. Without these capabilities, the capabilities of T-SQL would be limited to data profiling and exploration from an analytics perspective.
- Data scientists of major successful and large-scale organizations generally use R and/or Python as the primary data science language. So if you are an aspirant data scientist, there is no option but to learn R and apply it for different data science algorithms and techniques.
R tools in SQL Server Ecosystem
Installing Machine learning services installs Microsoft R Open server, with its set of command line tools. Microsoft Visual Studio 2017 also contains a set of tools to create R solution with almost the same native R environment integrated into Visual Studio IDE.
Installing In-Database Machine Learning Services installs extensions which enables executing R scripts from T-SQL. Usually SSMS is used to develop and execute T-SQL scripts, so this can be used to execute R scripts from T-SQL.
Now that we understand how R fits in the SQL Server Ecosystem, letís understand how to install and configure Machine Learning Server and Services with R in the next lesson.
In case you are interested in exploring R independently before starting to use if in SQL Server, you can download R from here.