In a previous tip on SQL Server 2016 Features in CTP2, one of the new features introduced was PolyBase. Our organization is thinking of deploying Big Data solutions running on the Apache Hadoop platform and I wanted to know how I can use PolyBase to perform data analysis.
In part 1 of this series, we looked at exploring PolyBase in SQL Server 2016 to get started on working with Big Data thru SQL Server. We've also looked at the PolyBase architecture and installed PolyBase in a single-server architecture. In this tip, we will install a multi-server configuration with one server acting as the head node and several others running as a compute node.
When working with a distributed, parallel architecture like Hadoop, a standalone, single-server deployment of PolyBase can become a performance bottleneck. That's because only one server is handling all the tasks of the control node and the compute node. The single-server deployment is only useful for proof-of-concept (POC), learning and testing PolyBase functionalities with Hadoop. For real production workloads, you will need to implement a PolyBase scale-out group.
PolyBase Scale-Out Groups
Similar to how you can scale-out Hadoop by adding multiple compute nodes in the cluster, you can do the same with PolyBase using the concept of a scale-out group. A PolyBase scale-out group allows you to create a cluster of SQL Server 2016 instances to leverage Hadoop's parallel processing architecture and improve query performance. You can start with a handful of SQL Server 2016 instances and, if the workload requires it, bring additional SQL Server 2016 instances into the cluster. This allows you to scale out PolyBase as the compute requires.
A diagram of the mapping between a PolyBase scale-out group and a multi-node Hadoop cluster is shown below.
Installing and Configuring a PolyBase Scale-Out Group
While the installation is similar to the one outlined in part 1 of this series, we have several additional prerequisites:
- SQL Server 2016 instances in the same Active Directory domain
- Active Directory domain user account to run the PolyBase services
- SQL Server 2016 Enterprise Edition for the head node
I'm assuming that your SQL Server 2016 instances are all on-premises, not Azure SQL Datawarehouse.
Assuming that you've installed the Oracle Java SE Runtime Environment (JRE) 7.51 (x64) or higher prior to running the SQL Server 2016 installation media, proceed to install the PolyBase Query Service for External Data.
In the PolyBase Configuration dialog box- unlike in the previous tip - select the option Use this SQL Server as a part of PolyBase scale-out group.
Take note of the port range in the Specify a port range for PolyBase services: text field. Your network engineers need to be made aware of these to make sure that the firewall rules will allow traffic to and from these port ranges on all of the SQL Server 2016 instances in the PolyBase scale-out group.
In the Server Configuration dialog box, provide the credentials of the Active Directory domain user account that will run the PolyBase Engine and PolyBase Data Movement services.
Continue with the installation. Repeat this process on all of the SQL Server 2016 instances that you would like to join to your PolyBase scale-out group.
Once PolyBase has been installed on all of the SQL Server 2016 instances that you would like to join to your PolyBase scale-out group, you need to choose which of the servers will act as a head node. In my environment, I have three (3) servers running SQL Server 2016 - PolyBase-Head, PolyBase-CP1 and PolyBase-CP2 - all running default instances. I will use the PolyBase-Head instance as my head/control node.
But before we configure the PolyBase scale-out group, we first need to do the following on all of the SQL Server 2016 instances:
- Verify that PolyBase has been successfully installed. You can check by running the query below. A return value of 1 means that PolyBase has been installed properly.
- Configure Hadoop connectivity. Configuring PolyBase connectivity to Hadoop does not necessarily mean you are only connecting to a Hadoop cluster. Since Hadoop is basically a distributed file system with layers of abstraction to access and perform work on the files, anything that presents itself as a distributed file system can be used. It's one of the reasons why Azure blob storage and Amazon S3 can be used as file systems for Hadoop - both of which are fully supported in open source Apache Hadoop distribution. We'll configure PolyBase to connect to a Cloudera 5.1, 5.2, 5.3, 5.4, 5.5, or 5.9 Hadoop cluster running on Linux.
EXEC sp_configure 'hadoop connectivity', 6; GO RECONFIGURE;
- A list of values for the different supported external data sources are provided in this MSDN article PolyBase Connectivity Configuration.
- Restart the SQL Server service. We need to restart the SQL Server database engine service for the changes to take effect. Doing so also restarts the SQL Server PolyBase Engine and SQL Server PolyBase Data Movement services. If your SQL Server instances are used for anything other than PolyBase, be sure to schedule a maintenance window to perform these tasks.
- Validate that the changes have taken effect. After restarting the SQL Server services, validate that the changes made have taken effect. Use the T-SQL command below to verify that the hadoop_connectivity configuration value is set to 6.
EXEC sp_configure 'hadoop connectivity'
Because we haven't done anything specific to configure which of the SQL Server 2016 instances will act as the head node or the compute node, all of them can function as a PolyBase head and compute node. You can verify this by expanding the PolyBase -> Scale-out Group folder for all of the SQL Server instances using SQL Server Management Studio.
And since we would like to use PolyBase-Head as the head node and PolyBase-CP1 and PolyBase-CP2 as compute nodes, we will configure them accordingly.
- Run the sp_polybase_join_group system stored procedure on the compute nodes - PolyBase-CP1 and PolyBase-CP2 - passing the following:
- Server Name - PolyBase-Head - of the SQL Server instance that will run as the head node of the PolyBase scale-out group
- Port Number where the control channel for the head node PolyBase Data Movement Service is running; the default value is 16450
- Instance Name - MSSQLSERVER - of the SQL Server machine - PolyBase-Head - that will run as the head node of the PolyBase scale-out group
EXEC sp_polybase_join_group N'PolyBase-Head', 16450, N'MSSQLSERVER'
Alternatively, you can use SQL Server Management Studio to configure the PolyBase scale-out group. From the chosen head node - PolyBase-Head - right-click on the PolyBase folder and select Configure PolyBase cluster.
This will open up the Configure PolyBase Scale-out Group dialog box. Click the plus (+) sign button.
This will prompt you to connect to the SQL Server instances that you want to add to the PolyBase scale-out group.
Select the SQL Server instance name and click the left arrow button to add it to the PolyBase scale-out cluster instances: list.
When all of the SQL Server instances have been added to the list, click OK. This will create the PolyBase scale-out group. Observe their corresponding roles.
Only the head node - PolyBase-Head - should be running the SQL Server PolyBase Engine service. The SQL Server PolyBase Data Movement Service can run on all of the nodes - including the head node.
SELECT * FROM sys.dm_exec_compute_nodes;
You can also expand the PolyBase -> Scale-out Group folder to verify the configuration.
In this tip, we've configured a PolyBase scale-out group using three (3) SQL Server 2016 instances. We can now start configuring this PolyBase scale-out group to run queries against a multi-node Cloudera Hadoop cluster.
- Read more on the following topics:
- PolyBase scale-out groups
- PolyBase Connectivity Configuration (Transact-SQL)
- sp_polybase_join_group system stored procedure
- Review some of the Big Data tips to get familiarized with how Hadoop works so we can properly integrate it with PolyBase
Last Update: 2017-03-09
About the author
View all my tips