SQL Server 2016 introduced support for executing R scripts inside the boundaries of T-SQL script execution in the database engine. With the advent of SQL Server 2017, this support has been enhanced and baked into SQL Server in the form of Machine Learning Server and Machine Learning Services (In-database). Machine Learning Server for Windows provides parallelized big data analysis, transformation, modeling and operationalization capabilities. In-database Machine Learning Services includes extensions that enable integration with R using standard T-SQL statements.
In order to understand the influence of R on professionals in terms of the SQL Server related roles and responsibilities, first we need to understand how R is integrated into SQL Server, its application and capabilities, and whether itís replacing T-SQL tasks or will it complement T-SQL capabilities. Understanding this will make it easier to assess the influence of R on our day to day activities. So letís try to understand how R works with SQL Server 2017 (CTP2).
Before we delve into the integration architecture, letís conceive a tangible perception of R for the purpose of discussion. Consider R as an open-source programming language that has all the programming constructs like variables, loops, operators, etc. like any other programming language. One key differentiator of R is that it consists of a vast number of libraries for statistical computing, visualizations, natural language processing, machine learning, etc.
Letís say that we want to compute certain statistics in our regular T-SQL code that requires very complex logic in terms of conceptualization as well as implementation. So we decide to use R for this purpose, as there are ready-to-use libraries that can compute these values without the need to learn or implement any complex statistical formulas. So here we have a use-case of using R inside T-SQL. A stored procedure named sp_execute_external_scripts is the provision inside T-SQL to make a request for executing external scripts.
When you run R code from T-SQL using Machine Learning Services, all R scripts are executed outside the SQL Server process, to provide security and greater manageability. SQL Server has introduced new components to enable integration with R, which are mentioned below.
The below architecture diagram from MSDN shows how these components work together to enable seamless integration of R with SQL Server Database Engine.
R in SQL Server is very new and yet to witness a large scale community adoption. Considering the existing unique and differentiating applications of R in the industry, the following application on R (in no particular order) in SQL Server seems probable.
R as well as T-SQL both are programming languages that can work with data. Both contain some common as well as unique programming constructs and capabilities. So the natural question that comes to mind is whether R can be used to replace some of the T-SQL data manipulation tasks? Will R replace some of the T-SQL features or R will complement T-SQL to extend its capabilities beyond data manipulation like machine learning, statistical modeling, etc.
R has libraries to deal with data in the way T-SQL manipulates data. R architecture natively was not built for multi-threaded, multi-processor, parallel processing. There are several parallel processing libraries for R available that allow you to explicitly run calculations in R simultaneously. Microsoft R includes a distribution of the Intel Math Kernel Library (MKL), which is used whenever possible for faster mathematical processing, compared to Rís standard Binary Linear Algebra (BLAS) library. But not all R commands make use of the BLAS, due to which those calculations wonít benefit from the parallel performance architecture. Microsoft is supposedly working on R to bake multi-processing architecture in more areas.
Among the most important additions to Microsoft R are the RevoScaleR and RevoPemaR packages. These are R packages that have been written largely in C or C++ for better performance. RevoScaleR includes a variety of APIs for data manipulation and analysis. The APIs have been optimized to analyze data sets that are too big to fit in memory and to perform computations distributed over several cores or processors. R solutions based on the RevoScaleR functions can work with very large data sets and are not bound by local memory. PEMA stands for Parallel External Memory Algorithm. The RevoPemaR package provides APIs that you can use to develop your own parallel algorithms.
So architecturally speaking it makes sense to use T-SQL for dealing with large volumes of data calculations, and use R in the area of its strength i.e. statistical modeling, machine learning, predictive analytics, etc.
Installing Machine learning services installs Microsoft R Open server, with its set of command line tools. Microsoft Visual Studio 2017 also contains a set of tools to create R solution with almost the same native R environment integrated into Visual Studio IDE.
Installing In-Database Machine Learning Services installs extensions which enables executing R scripts from T-SQL. Usually SSMS is used to develop and execute T-SQL scripts, so this can be used to execute R scripts from T-SQL.
Now that we understand how R fits in the SQL Server Ecosystem, letís understand how to install and configure Machine Learning Server and Services with R in the next lesson.
In case you are interested in exploring R independently before starting to use if in SQL Server, you can download R from here.