Learn more about SQL Server tools

 

Tutorials          DBA          Dev          BI          Career          Categories          Videos          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Events

DBA    Dev    BI    Categories

 

Load SQL Server T-SQL Query Results to Hadoop Using Sqoop


By:   |   Read Comments   |   Related Tips: More > Big Data

Problem

I need to load the results from a SQL Server T-SQL query to a Hadoop Distributed File System. How can I do this?  How can I validate the data was loaded correctly?

Solution

Apache's Sqoop has a query argument that will allow you to pass in a SQL Server T-SQL query so you may select specific columns and rows to be loaded to the Hadoop Distributed File System (HDFS). Also you can join tables and use T-SQL functions for formatting. In this example, we will join the Sales.SalesOrderHeader and Sales.SalesOrderDetail tables in the SQL Server 2014 AdventureWorks2014 database to return the SalesOrderDetailID column, the OrderDate column formatted as YYYYMMDD, and the OrderQty and LineTotal columns where the LineTotal is 100.00 or greater. As with previous tips in this series, this tip is written using SQL Server 2014 and a Cloudera virtual machine running on a laptop.

The Sqoop command we will examine is listed below and we will break down each argument in the following bullet points. Please note that the command is supposed to be on one complete line or with the backslash (the Linux command line continuation character) at the end of each line except the last.

sqoop import --connect 'jdbc:sqlserver://aaa.bbb.ccc.ddd:pppp;databasename=AdventureWorks2014'              --username 'sqoopLogin' -P              --target-dir 'Sales100OrGreater'              --split-by 'SalesOrderDetailID'              --query "select d.SalesOrderDetailID, CONVERT(nvarchar(8), h.OrderDate, 112) as OrderDate, 
    d.OrderQty, d.LineTotal 
    from Sales.SalesOrderDetail as d 
    inner join Sales.SalesOrderHeader as h 
    on d.SalesOrderID = h.SalesOrderID 
    where d.LineTotal>=100.00 and \$CONDITIONS"
  • sqoop import - The executable is named sqoop and we are instructing it to import the data from a table or view from a database to HDFS.
  • --connect - With the --connect argument, we are passing in the jdbc connect string for SQL Server. In this case, we use the IP address, port number, and database name.
  • --username - In this example, the user name is a SQL Server login, not a Windows login. Our database is set up to authenticate in mixed mode. We have a server login named sqoopLogin, and we have a database user name sqoopUser which is a member of the db_datareader role and has a default schema of dbo.
  • -P - This will prompt the command line user for the password. If Sqoop is rarely executed, this might be a good option. There are multiple other ways to automatically pass the password to the command, but we are trying to keep it simple for this tip.
  • --target-dir - The target directory tells sqoop in which directory on the HDFS to store the selected data. This argument is required by sqoop when using a free-form query.
  • --split-by - Even though we are selecting the primary key of the SalesOrderDetail table, we still have to provide Sqoop with a unique identifier to help it distribute the workload.
  • --query - This is the argument in which we supply our T-SQL query. The query above is enclosed in double quotes. Also notice there is not a backslash (the line continuation character) in the multiple lines containing the query. Finally, notice the "and \$CONDITIONS" at the end of the WHERE clause. This token is required by sqoop because sqoop will automatically replace the $CONDITIONS token with a unique expression. Please do not define $CONDITIONS as an environment variable.

The output from the sqoop command is shown below.

[hdfs@localhost ~]$ ./sqoopCommand.sh 
16/08/23 08:07:49 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.2.0
Enter password: 
16/08/23 08:07:53 INFO manager.SqlManager: Using default fetchSize of 1000
16/08/23 08:07:53 INFO tool.CodeGenTool: Beginning code generation
16/08/23 08:07:54 INFO manager.SqlManager: Executing SQL statement: select d.SalesOrderDetailID, CONVERT(nvarchar(8), h.OrderDate, 112) as OrderDate, d.OrderQty, d.LineTotal from Sales.SalesOrderDetail as d inner join Sales.SalesOrderHeader as h on d.SalesOrderID = h.SalesOrderID where d.LineTotal>=100.00 and  (1 = 0) 
16/08/23 08:07:55 INFO manager.SqlManager: Executing SQL statement: select d.SalesOrderDetailID, CONVERT(nvarchar(8), h.OrderDate, 112) as OrderDate, d.OrderQty, d.LineTotal from Sales.SalesOrderDetail as d inner join Sales.SalesOrderHeader as h on d.SalesOrderID = h.SalesOrderID where d.LineTotal>=100.00 and  (1 = 0) 
16/08/23 08:07:55 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-0.20-mapreduce
Note: /tmp/sqoop-training/compile/822e63bc3c3869e0c10ccacb69821dc5/QueryResult.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
16/08/23 08:07:59 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-training/compile/822e63bc3c3869e0c10ccacb69821dc5/QueryResult.jar
16/08/23 08:07:59 INFO mapreduce.ImportJobBase: Beginning query import.
16/08/23 08:08:02 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
16/08/23 08:08:08 INFO db.DBInputFormat: Using read commited transaction isolation
16/08/23 08:08:08 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(SalesOrderDetailID), MAX(SalesOrderDetailID) FROM (select d.SalesOrderDetailID, CONVERT(nvarchar(8), h.OrderDate, 112) as OrderDate, d.OrderQty, d.LineTotal from Sales.SalesOrderDetail as d inner join Sales.SalesOrderHeader as h on d.SalesOrderID = h.SalesOrderID where d.LineTotal>=100.00 and  (1 = 1) ) AS t1
16/08/23 08:08:09 INFO mapred.JobClient: Running job: job_201608230640_0002
16/08/23 08:08:10 INFO mapred.JobClient:  map 0% reduce 0%
16/08/23 08:08:37 INFO mapred.JobClient:  map 50% reduce 0%
16/08/23 08:08:51 INFO mapred.JobClient:  map 75% reduce 0%
16/08/23 08:08:52 INFO mapred.JobClient:  map 100% reduce 0%
16/08/23 08:08:55 INFO mapred.JobClient: Job complete: job_201608230640_0002
16/08/23 08:08:55 INFO mapred.JobClient: Counters: 23
16/08/23 08:08:55 INFO mapred.JobClient:   File System Counters
16/08/23 08:08:55 INFO mapred.JobClient:     FILE: Number of bytes read=0
16/08/23 08:08:55 INFO mapred.JobClient:     FILE: Number of bytes written=1190836
16/08/23 08:08:55 INFO mapred.JobClient:     FILE: Number of read operations=0
16/08/23 08:08:55 INFO mapred.JobClient:     FILE: Number of large read operations=0
16/08/23 08:08:55 INFO mapred.JobClient:     FILE: Number of write operations=0
16/08/23 08:08:55 INFO mapred.JobClient:     HDFS: Number of bytes read=534
16/08/23 08:08:55 INFO mapred.JobClient:     HDFS: Number of bytes written=1814075
16/08/23 08:08:55 INFO mapred.JobClient:     HDFS: Number of read operations=12
16/08/23 08:08:55 INFO mapred.JobClient:     HDFS: Number of large read operations=0
16/08/23 08:08:55 INFO mapred.JobClient:     HDFS: Number of write operations=6
16/08/23 08:08:55 INFO mapred.JobClient:   Job Counters 
16/08/23 08:08:55 INFO mapred.JobClient:     Launched map tasks=4
16/08/23 08:08:55 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=69593
16/08/23 08:08:55 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=0
16/08/23 08:08:55 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
16/08/23 08:08:55 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
16/08/23 08:08:55 INFO mapred.JobClient:   Map-Reduce Framework
16/08/23 08:08:55 INFO mapred.JobClient:     Map input records=63605
16/08/23 08:08:55 INFO mapred.JobClient:     Map output records=63605
16/08/23 08:08:55 INFO mapred.JobClient:     Input split bytes=534
16/08/23 08:08:55 INFO mapred.JobClient:     Spilled Records=0
16/08/23 08:08:55 INFO mapred.JobClient:     CPU time spent (ms)=11250
16/08/23 08:08:55 INFO mapred.JobClient:     Physical memory (bytes) snapshot=451989504
16/08/23 08:08:55 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2954272768
16/08/23 08:08:55 INFO mapred.JobClient:     Total committed heap usage (bytes)=127401984
16/08/23 08:08:55 INFO mapreduce.ImportJobBase: Transferred 1.73 MB in 55.1263 seconds (32.1363 KB/sec)
16/08/23 08:08:55 INFO mapreduce.ImportJobBase: Retrieved 63605 records.
[hdfs@localhost ~]$

Notice the last line of output above shows that 63,605 records were retrieved. This corresponds to the 63,605 records in the T-SQL query results as we see in the image below. Also in the image below, we see the first 5 records returned by the query.

Ttable row count and top 5 rows

After executing the sqoop command, we can execute the hdfs dfs -ls command below to see the directory that was created by default with the table name on the HDFS. We also see the DimCustomer directory from a previous tip.

[hdfs@localhost ~]$ hdfs dfs -ls
Found 2 items
drwxrwxrwx   - training supergroup          0 2016-08-16 21:55 DimCustomer
drwxrwxrwx   - training supergroup          0 2016-08-23 08:08 Sales100OrGreater
[hdfs@localhost ~]$ 

We can use the hdfs dfs -ls command below to list the contents of the Sales100OrGreater directory. Notice how 4 partitions were created by default.

[hdfs@localhost ~]$ hdfs dfs -ls Sales100OrGreater
Found 6 items
-rw-rw-rw-   1 training supergroup          0 2016-08-23 08:08 Sales100OrGreater/_SUCCESS
drwxrwxrwx   - training supergroup          0 2016-08-23 08:08 Sales100OrGreater/_logs
-rw-rw-rw-   1 training supergroup     717914 2016-08-23 08:08 Sales100OrGreater/part-m-00000
-rw-rw-rw-   1 training supergroup     447516 2016-08-23 08:08 Sales100OrGreater/part-m-00001
-rw-rw-rw-   1 training supergroup     353060 2016-08-23 08:08 Sales100OrGreater/part-m-00002
-rw-rw-rw-   1 training supergroup     295585 2016-08-23 08:08 Sales100OrGreater/part-m-00003
[hdfs@localhost ~]$

The hdfs dfs -cat command below will display the first 10 records in the first partition on the HDFS. We can use this command to verify the correct records were loaded.

[hdfs@localhost ~]$ hdfs dfs -cat Sales100OrGreater/part-m-00000|head
1,20110531,1,2024.994000
2,20110531,3,6074.982000
3,20110531,1,2024.994000
4,20110531,1,2039.994000
5,20110531,1,2039.994000
6,20110531,2,4079.988000
7,20110531,1,2039.994000
13,20110531,1,419.458900
14,20110531,1,874.794000
15,20110531,1,809.760000
[hdfs@localhost ~]$
Next Steps


Last Update:






About the author
MSSQLTips author Dallas Snider Dr. Dallas Snider is an Assistant Professor in the Computer Science Department at the University of West Florida and has 18+ years of SQL experience.

View all my tips
Related Resources


 









Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Notify for updates 


Get free SQL tips:

*Enter Code refresh code     



Learn more about SQL Server tools