Create a SQL Server PolyBase Scale-out Group Using Amazon Web Services
PolyBase is a SQL Server feature that allows you to retrieve data from remote sources, no matter their type, using only T-SQL. If you're familiar with SQL Server, it has benefits over linked servers in that PolyBase isn't single-threaded and you can use it with SQL Server on Linux. If you're familiar with Oracle, it has benefits compared to heterogeneous services because you don't need to install external components: by default, you have access to SQL Server, Hadoop, Azure Storage, Oracle, Teradata, CosmosDB, and MongoDB. Note that PolyBase is not meant for data migrations, it helps with real-time reporting by simplifying ETL processes.
One of the most impressive features in SQL Server 2019 related to PolyBase is the ability to create scale-out groups, pairing it with its predecessors Parallel Data Warehouse (PDW) and Analytics Platform System (APS) in terms of computation nodes. A scale-out group is a farm of SQL Server instances that ingest data from an external source in parallel; the best part is they don't need to be clustered nor need to be the Enterprise version.
Let's look at how to build a scale-out group for SQL Server PolyBase.
One of the prerequisites is having the machines in the same domain with a domain account, so you can't create this configuration using Docker. Also, you can't install the required features in a Windows Server image and your Windows host prevents you from opening the required ports for Samba.
Let's see how we can build this solution.
AWS Proof of Concept Program
In AWS, setting up a Docker container with SQL Server 2019 on Windows is free for a test environment, but one thing missing is the creation of a scale-out group in AWS which isn't free. If you don't have money to pay for this service, don't worry, AWS helps you in many ways to try their solutions, one being the Proof of Concept program.
Create a Scale-out Group on AWS
To create the scale-out group, log in to https://aws.amazon.com/, and on the upper right corner under "My Account," click on "AWS Management Console." You can either search for the service "Directory Service" or click in the upper left corner under "Services" then "Security, Identity & Compliance" and finally "Directory Service". On the "Directories" page click on "Set up directory" and there you need to choose the directory type.
You need to select "AWS Managed Microsoft AD," because the second option is to redirect requests to your on-premises directory, and the third option is for application identity sign-in. Click "Next" and you need to enter the Directory information.
- Edition: Standard for small-medium sized businesses and Enterprise for large businesses. For testing it doesn't matter which one you choose.
- Directory DNS name: A fully qualified domain name, minimum two words.
- Directory NetBIOS name: Optional.
- Admin password: Take note of this password of the default user named "Admin".
After you click "Next" you need to enter the VPC and subnets, or you can leave the defaults.
After you click "Next" you need to review the information. If everything looks correct, click on "Create directory". Note that the first time you create a directory it goes through a validation process; during this process, you can't make use of the directory. In my case it took about five hours.
After the directory is created and validated, you need to create a pair of EC2 instances (virtual machines), so either search for the service "EC2" or from the upper left corner under "Services" choose "Compute" and then "EC2"; but note, in this case, you will see a dashboard of your instances, and you need to click on "Launch instance" and then again on "Launch instance". On the next screen, you need to select the base image for your EC2s, so search "Windows" and note at this time there are 19 images available; avoid selecting the ones that already come with SQL Server installed as we will create it from scratch.
Once you select it, you need to choose the EC2 size appropriate for PolyBase: you need a minimum 2 GB RAM so an appropriate type is t3.small which has 2 vCPUs and 2 GB RAM, then click on "Next: Configure Instance Details". In the next screen there are so many options to set up your infrastructure and to reduce your costs, but you must provide at least the following information and click "Review and Launch".
- Number of instances: 2
- Domain join directory: select the directory created earlier
- IAM role: click on "Create new IAM role" and follow the instructions in step 6.e from here
Once the instances are launched you need to enable the Active Directory tools to create a domain user. To do this, RDP into one of them with the user "Admin" and the password you specified when creating the domain, then go to Start > Server Manager > Add roles and features > Next > Role-based or feature-based installation > Next > Next > Next > choose "Group Policy Management" and "Remote Server Administration Tools"-"Role administration tools"-"AD DS and AD LDS tools" and "DNS server tools" > Next > Install > Close. e.
Then go to Start > Windows Administrative Tools > Active Directory Users and Computers > select your domain > Users > New > User.
To install SQL Server 2019, log in to both EC2s through RDP and there you can copy the SQL Server trial executable file which you can download from here named "SQL2019-SSEI-Eval.exe". When executed, this program helps you download SQL Server 2019 Evaluation. Then it's easy to mount the downloaded ISO file and install SQL Server from the command line as follows (the following assumes it was mounted in the D: drive).
& "d:\setup.exe" /Q /Action=Install /IAcceptSQLServerLicenseTerms /IndicateProgress /Features=SQLEngine,Conn,PolyBase /InstanceName=MSSQLSERVER /TcpEnabled=1 /SecurityMode=SQL /SaPwd=YourPwd /SqlSysAdminAccounts=pabechevb\svcacct /PBScaleOut /PBEngSvcAccount=pabechevb\svcacct /PBEngSvcPassword=YourPwd /PBDMSSvcAccount=pabechevb\svcacct /PBDMSSvcPassword=YourPwd
In the command above, you can add more features, choose your instance name, enter your sysadmin password, your sysadmin account, your PolyBase engine service account and password, and your PolyBase Data Movement Service (DMS) account and password.
You can also install SQL Server from the GUI, double click "setup.exe" and when prompted, choose "Use this SQL Server as a part of PolyBase scale-out group" and specify the service account for PolyBase Engine and DMS service.
Once SQL Server is installed, you can use 'sqlcmd' to finish the setup. Run the following command and replace 'YourPwd' with the appropriate password to connect to the local instance of SQL Server.
sqlcmd -Usa -PYourPwd
Then run the following commands in all nodes to enable PolyBase (no restart needed).
EXEC sp_configure @configname = 'polybase enabled', @configvalue = 1 GO RECONFIGURE GO
To mark a node as a compute node (secondary), you need to run the following command in the secondary instances. Take note 'node1' is the name of the first EC2 instance, the port used for communication is 16450, and the SQL Server instance name in 'node1' is: ‘MSSQLSERVER'.
EXEC sp_polybase_join_group 'node1', 16450, 'MSSQLSERVER' GO
As indicated when you run the above command, you need to stop the PolyBase Engine service and restart the PolyBase DMS service. Below is the output of the commands run in 'node2'.
Below are the services in 'node2'.
From the head node ('node1'), you can run the following command to view the scale-out group.
select * from sys.dm_exec_compute_nodes GO
Below is the output of the commands run in 'node1'.
You can find additional information about PolyBase scale-out groups in the following Microsoft resources:
When you have this configuration running, you can follow several of the examples described in my book to create for free any other data source and consume its information including: SQL Server, Hadoop, Azure storage, Spark, Oracle, Teradata, Cassandra, MongoDB, CosmosDB, MySQL, PostgreSQL, MariaDB, and SAP HANA.
Also remember instead of a pair of EC2 instances you can add as many instances as you like, and create them with a higher amount of CPUs and RAM, for testing high volumes of data and measuring the performance, to see if it satisfies your business needs, before deciding for the long term. Finally, don't forget to delete all your resources once you're done, in order to not get a surprise bill at the end of the month on your credit card.
About the author
View all my tips
Article Last Updated: 2021-11-19