Load SQL Server T-SQL Query Results to Hadoop Using Sqoop

By: Dallas Snider | Comments | Related: More > Big Data

Problem

I need to load the results from a SQL Server T-SQL query to a Hadoop Distributed File System. How can I do this? How can I validate the data was loaded correctly?

Solution

Apache's Sqoop has a query argument that will allow you to pass in a SQL Server T-SQL query so you may select specific columns and rows to be loaded to the Hadoop Distributed File System (HDFS). Also you can join tables and use T-SQL functions for formatting. In this example, we will join the Sales.SalesOrderHeader and Sales.SalesOrderDetail tables in the SQL Server 2014 AdventureWorks2014 database to return the SalesOrderDetailID column, the OrderDate column formatted as YYYYMMDD, and the OrderQty and LineTotal columns where the LineTotal is 100.00 or greater. As with previous tips in this series, this tip is written using SQL Server 2014 and a Cloudera virtual machine running on a laptop.

The Sqoop command we will examine is listed below and we will break down each argument in the following bullet points. Please note that the command is supposed to be on one complete line or with the backslash (the Linux command line continuation character) at the end of each line except the last.

sqoop import --connect 'jdbc:sqlserver://aaa.bbb.ccc.ddd:pppp;databasename=AdventureWorks2014'              --username 'sqoopLogin' -P              --target-dir 'Sales100OrGreater'              --split-by 'SalesOrderDetailID'              --query "select d.SalesOrderDetailID, CONVERT(nvarchar(8), h.OrderDate, 112) as OrderDate, 
    d.OrderQty, d.LineTotal 
    from Sales.SalesOrderDetail as d 
    inner join Sales.SalesOrderHeader as h 
    on d.SalesOrderID = h.SalesOrderID 
    where d.LineTotal>=100.00 and \$CONDITIONS"

sqoop import - The executable is named sqoop and we are instructing it to import the data from a table or view from a database to HDFS.
--connect - With the --connect argument, we are passing in the jdbc connect string for SQL Server. In this case, we use the IP address, port number, and database name.
--username - In this example, the user name is a SQL Server login, not a Windows login. Our database is set up to authenticate in mixed mode. We have a server login named sqoopLogin, and we have a database user name sqoopUser which is a member of the db_datareader role and has a default schema of dbo.
-P - This will prompt the command line user for the password. If Sqoop is rarely executed, this might be a good option. There are multiple other ways to automatically pass the password to the command, but we are trying to keep it simple for this tip.
--target-dir - The target directory tells sqoop in which directory on the HDFS to store the selected data. This argument is required by sqoop when using a free-form query.
--split-by - Even though we are selecting the primary key of the SalesOrderDetail table, we still have to provide Sqoop with a unique identifier to help it distribute the workload.
--query - This is the argument in which we supply our T-SQL query. The query above is enclosed in double quotes. Also notice there is not a backslash (the line continuation character) in the multiple lines containing the query. Finally, notice the "and \$CONDITIONS" at the end of the WHERE clause. This token is required by sqoop because sqoop will automatically replace the $CONDITIONS token with a unique expression. Please do not define $CONDITIONS as an environment variable.

The output from the sqoop command is shown below.

[hdfs@localhost ~]$ ./sqoopCommand.sh 
16/08/23 08:07:49 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.2.0
Enter password: 
16/08/23 08:07:53 INFO manager.SqlManager: Using default fetchSize of 1000
16/08/23 08:07:53 INFO tool.CodeGenTool: Beginning code generation
16/08/23 08:07:54 INFO manager.SqlManager: Executing SQL statement: select d.SalesOrderDetailID, CONVERT(nvarchar(8), h.OrderDate, 112) as OrderDate, d.OrderQty, d.LineTotal from Sales.SalesOrderDetail as d inner join Sales.SalesOrderHeader as h on d.SalesOrderID = h.SalesOrderID where d.LineTotal>=100.00 and  (1 = 0) 
16/08/23 08:07:55 INFO manager.SqlManager: Executing SQL statement: select d.SalesOrderDetailID, CONVERT(nvarchar(8), h.OrderDate, 112) as OrderDate, d.OrderQty, d.LineTotal from Sales.SalesOrderDetail as d inner join Sales.SalesOrderHeader as h on d.SalesOrderID = h.SalesOrderID where d.LineTotal>=100.00 and  (1 = 0) 
16/08/23 08:07:55 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-0.20-mapreduce
Note: /tmp/sqoop-training/compile/822e63bc3c3869e0c10ccacb69821dc5/QueryResult.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
16/08/23 08:07:59 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-training/compile/822e63bc3c3869e0c10ccacb69821dc5/QueryResult.jar
16/08/23 08:07:59 INFO mapreduce.ImportJobBase: Beginning query import.
16/08/23 08:08:02 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
16/08/23 08:08:08 INFO db.DBInputFormat: Using read commited transaction isolation
16/08/23 08:08:08 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(SalesOrderDetailID), MAX(SalesOrderDetailID) FROM (select d.SalesOrderDetailID, CONVERT(nvarchar(8), h.OrderDate, 112) as OrderDate, d.OrderQty, d.LineTotal from Sales.SalesOrderDetail as d inner join Sales.SalesOrderHeader as h on d.SalesOrderID = h.SalesOrderID where d.LineTotal>=100.00 and  (1 = 1) ) AS t1
16/08/23 08:08:09 INFO mapred.JobClient: Running job: job_201608230640_0002
16/08/23 08:08:10 INFO mapred.JobClient:  map 0% reduce 0%
16/08/23 08:08:37 INFO mapred.JobClient:  map 50% reduce 0%
16/08/23 08:08:51 INFO mapred.JobClient:  map 75% reduce 0%
16/08/23 08:08:52 INFO mapred.JobClient:  map 100% reduce 0%
16/08/23 08:08:55 INFO mapred.JobClient: Job complete: job_201608230640_0002
16/08/23 08:08:55 INFO mapred.JobClient: Counters: 23
16/08/23 08:08:55 INFO mapred.JobClient:   File System Counters
16/08/23 08:08:55 INFO mapred.JobClient:     FILE: Number of bytes read=0
16/08/23 08:08:55 INFO mapred.JobClient:     FILE: Number of bytes written=1190836
16/08/23 08:08:55 INFO mapred.JobClient:     FILE: Number of read operations=0
16/08/23 08:08:55 INFO mapred.JobClient:     FILE: Number of large read operations=0
16/08/23 08:08:55 INFO mapred.JobClient:     FILE: Number of write operations=0
16/08/23 08:08:55 INFO mapred.JobClient:     HDFS: Number of bytes read=534
16/08/23 08:08:55 INFO mapred.JobClient:     HDFS: Number of bytes written=1814075
16/08/23 08:08:55 INFO mapred.JobClient:     HDFS: Number of read operations=12
16/08/23 08:08:55 INFO mapred.JobClient:     HDFS: Number of large read operations=0
16/08/23 08:08:55 INFO mapred.JobClient:     HDFS: Number of write operations=6
16/08/23 08:08:55 INFO mapred.JobClient:   Job Counters 
16/08/23 08:08:55 INFO mapred.JobClient:     Launched map tasks=4
16/08/23 08:08:55 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=69593
16/08/23 08:08:55 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=0
16/08/23 08:08:55 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
16/08/23 08:08:55 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
16/08/23 08:08:55 INFO mapred.JobClient:   Map-Reduce Framework
16/08/23 08:08:55 INFO mapred.JobClient:     Map input records=63605
16/08/23 08:08:55 INFO mapred.JobClient:     Map output records=63605
16/08/23 08:08:55 INFO mapred.JobClient:     Input split bytes=534
16/08/23 08:08:55 INFO mapred.JobClient:     Spilled Records=0
16/08/23 08:08:55 INFO mapred.JobClient:     CPU time spent (ms)=11250
16/08/23 08:08:55 INFO mapred.JobClient:     Physical memory (bytes) snapshot=451989504
16/08/23 08:08:55 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2954272768
16/08/23 08:08:55 INFO mapred.JobClient:     Total committed heap usage (bytes)=127401984
16/08/23 08:08:55 INFO mapreduce.ImportJobBase: Transferred 1.73 MB in 55.1263 seconds (32.1363 KB/sec)
16/08/23 08:08:55 INFO mapreduce.ImportJobBase: Retrieved 63605 records.
[hdfs@localhost ~]$

Notice the last line of output above shows that 63,605 records were retrieved. This corresponds to the 63,605 records in the T-SQL query results as we see in the image below. Also in the image below, we see the first 5 records returned by the query.

After executing the sqoop command, we can execute the hdfs dfs -ls command below to see the directory that was created by default with the table name on the HDFS. We also see the DimCustomer directory from a previous tip.

[hdfs@localhost ~]$ hdfs dfs -ls
Found 2 items
drwxrwxrwx   - training supergroup          0 2016-08-16 21:55 DimCustomer
drwxrwxrwx   - training supergroup          0 2016-08-23 08:08 Sales100OrGreater
[hdfs@localhost ~]$

We can use the hdfs dfs -ls command below to list the contents of the Sales100OrGreater directory. Notice how 4 partitions were created by default.

[hdfs@localhost ~]$ hdfs dfs -ls Sales100OrGreater
Found 6 items
-rw-rw-rw-   1 training supergroup          0 2016-08-23 08:08 Sales100OrGreater/_SUCCESS
drwxrwxrwx   - training supergroup          0 2016-08-23 08:08 Sales100OrGreater/_logs
-rw-rw-rw-   1 training supergroup     717914 2016-08-23 08:08 Sales100OrGreater/part-m-00000
-rw-rw-rw-   1 training supergroup     447516 2016-08-23 08:08 Sales100OrGreater/part-m-00001
-rw-rw-rw-   1 training supergroup     353060 2016-08-23 08:08 Sales100OrGreater/part-m-00002
-rw-rw-rw-   1 training supergroup     295585 2016-08-23 08:08 Sales100OrGreater/part-m-00003
[hdfs@localhost ~]$

The hdfs dfs -cat command below will display the first 10 records in the first partition on the HDFS. We can use this command to verify the correct records were loaded.

[hdfs@localhost ~]$ hdfs dfs -cat Sales100OrGreater/part-m-00000|head
1,20110531,1,2024.994000
2,20110531,3,6074.982000
3,20110531,1,2024.994000
4,20110531,1,2039.994000
5,20110531,1,2039.994000
6,20110531,2,4079.988000
7,20110531,1,2039.994000
13,20110531,1,419.458900
14,20110531,1,874.794000
15,20110531,1,809.760000
[hdfs@localhost ~]$

Next Steps

Please experiment with different formatting functions, joins and WHERE clauses.
Finally, please check out these other tips and tutorials on T-SQL on MSSQLTips.com:
- SQL Server Big Data Tips
- Our complete tutorial list

About the author

Dr. Dallas Snider is an Assistant Professor in the Computer Science Department at the University of West Florida and has 18+ years of SQL experience.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips

Load SQL Server T-SQL Query Results to Hadoop Using Sqoop

Problem

Solution

Next Steps

About the author

Comments For This Article