Learn more about SQL Server tools

mssqltips logo
giveaway
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

 

Export from Hadoop File System to a SQL Server Database Table


By:   |   Read Comments (2)   |   Related Tips: More > Big Data

Problem

I need to export data from the Hadoop Distributed File System (HDFS) to a SQL Server database table. How can I do this?

Solution

Apache's Sqoop allows for importing data from a database such as SQL Server to the HDFS, and for exporting data from the HDFS to a database table. In this tip we will use Apache Sqoop's export functionality to move data stored on the HDFS to a SQL Server database table. As with previous tips in this series, this tip is written using SQL Server 2014 and a Cloudera virtual machine running on a laptop.

The contents of the file on the HDFS are shown below. There are three columns in this comma-separated value file. The first column is a unique integer identifier. The second column is a string used for a description. The third column is the unit cost. The destination table will have columns to accommodate these three columns and their data types.

[hdfs@localhost:/mssqltips]$ hdfs dfs -cat mssqlTips/linuxDataFile.csv
1,Product A,1.01
2,Product B,2.02
3,Product C,3.03
4,Product D,4.04
5,Product E,5.05
6,Product F,6.06
7,Product G,7.07
8,Product H,8.08
9,Product I,9.09
10,Product J,10.10
11,Product K,11.11
12,Product L,12.12
13,Product M,13.13
14,Product N,14.14
15,Product O,15.15
16,Product P,16.16
17,Product Q,17.17
18,Product R,18.18
19,Product S,19.19
20,Product T,20.20
21,Product U,21.21
22,Product V,22.22
23,Product W,23.23
24,Product X,24.24
25,Product Y,25.25
26,Product Z,26.26
[hdfs@localhost:/mssqltips]$ 

The T-SQL below was used to create the destination table for this tip.

create table dbo.tblProductData
(
  ProductKey int not null PRIMARY KEY,    
  ProductName varchar(50),
  UnitCost money
)

The image below shows the destination table in the AdventureWorks2014 database as displayed in the Object Explorer.

. Target table in the AdventureWorks2014 SQL Server database

The Sqoop command we will examine is listed below and we will break down each argument in the following bullet points. Please note that the command is supposed to be on one complete line or with the backslash (the Linux command line continuation character) at the end of each line except the last.

sqoop export --connect 'jdbc:sqlserver://aaa.bbb.ccc.ddd:pppp;databasename=AdventureWorks2014'              --username 'sqoopLogin' -P              --export-dir 'mssqlTips'              --table 'tblProductData'              --              --schema dbo 
  • sqoop export - The executable is named sqoop and we are instructing it to export the data from an HDFS file to a database.
  • --connect - With the --connect argument, we are passing in the jdbc connect string for SQL Server. In this case, we use the IP address, port number, and database name.
  • --username - In this example, the user name is a SQL Server login, not a Windows login. Our database is set up to authenticate in mixed mode. We have a server login named sqoopLogin, and we have a database user name sqoopUser which is a member of the db_datawriter role and has a default schema of dbo.
  • -P - This will prompt the command line user for the password. If Sqoop is rarely executed, this might be a good option. There are multiple other ways to automatically pass the password to the command, but we are trying to keep it simple for this tip.
  • --export-dir - The export directory tells sqoop in which directory the file is stored on the HDFS. This argument is required by Sqoop when using the export option.
  • --table - This argument provides sqoop with the destination table on the SQL Server database. This argument is also required by Sqoop when using the export option.
  • -- This is not a typographical error. The double dash tells Sqoop that the following arguments are database specific.
  • --schema - This argument specifies the schema in which our table resides.

The output from the sqoop command is shown below.

[hdfs@localhost:/mssqltips]$ ./sqoopExportCommand.sh
16/09/19 12:03:22 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.2.0
Enter password: 
16/09/19 12:03:25 INFO manager.SqlManager: Using default fetchSize of 1000
16/09/19 12:03:25 INFO manager.SQLServerManager: We will use schema dbo
16/09/19 12:03:25 INFO tool.CodeGenTool: Beginning code generation
16/09/19 12:03:26 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM [dbo].[tblProductData] AS t WHERE 1=0
16/09/19 12:03:26 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-0.20-mapreduce
Note: /tmp/sqoop-training/compile/3eaf9ad50f5ebf5ef893fba0e41859b1/tblProductData.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
16/09/19 12:03:29 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-training/compile/3eaf9ad50f5ebf5ef893fba0e41859b1/tblProductData.jar
16/09/19 12:03:29 INFO mapreduce.ExportJobBase: Beginning export of tblProductData
16/09/19 12:03:30 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
16/09/19 12:03:32 INFO input.FileInputFormat: Total input paths to process : 1
16/09/19 12:03:32 INFO input.FileInputFormat: Total input paths to process : 1
16/09/19 12:03:33 INFO mapred.JobClient: Running job: job_201609071028_0005
16/09/19 12:03:34 INFO mapred.JobClient:  map 0% reduce 0%
16/09/19 12:03:48 INFO mapred.JobClient:  map 25% reduce 0%
16/09/19 12:03:49 INFO mapred.JobClient:  map 50% reduce 0%
16/09/19 12:04:00 INFO mapred.JobClient:  map 100% reduce 0%
16/09/19 12:04:02 INFO mapred.JobClient: Job complete: job_201609071028_0005
16/09/19 12:04:02 INFO mapred.JobClient: Counters: 24
16/09/19 12:04:02 INFO mapred.JobClient:   File System Counters
16/09/19 12:04:02 INFO mapred.JobClient:     FILE: Number of bytes read=0
16/09/19 12:04:02 INFO mapred.JobClient:     FILE: Number of bytes written=1185776
16/09/19 12:04:02 INFO mapred.JobClient:     FILE: Number of read operations=0
16/09/19 12:04:02 INFO mapred.JobClient:     FILE: Number of large read operations=0
16/09/19 12:04:02 INFO mapred.JobClient:     FILE: Number of write operations=0
16/09/19 12:04:02 INFO mapred.JobClient:     HDFS: Number of bytes read=1778
16/09/19 12:04:02 INFO mapred.JobClient:     HDFS: Number of bytes written=0
16/09/19 12:04:02 INFO mapred.JobClient:     HDFS: Number of read operations=16
16/09/19 12:04:02 INFO mapred.JobClient:     HDFS: Number of large read operations=0
16/09/19 12:04:02 INFO mapred.JobClient:     HDFS: Number of write operations=0
16/09/19 12:04:02 INFO mapred.JobClient:   Job Counters 
16/09/19 12:04:02 INFO mapred.JobClient:     Launched map tasks=4
16/09/19 12:04:02 INFO mapred.JobClient:     Data-local map tasks=4
16/09/19 12:04:02 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=47822
16/09/19 12:04:02 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=0
16/09/19 12:04:02 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
16/09/19 12:04:02 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
16/09/19 12:04:02 INFO mapred.JobClient:   Map-Reduce Framework
16/09/19 12:04:02 INFO mapred.JobClient:     Map input records=26
16/09/19 12:04:02 INFO mapred.JobClient:     Map output records=26
16/09/19 12:04:02 INFO mapred.JobClient:     Input split bytes=576
16/09/19 12:04:02 INFO mapred.JobClient:     Spilled Records=0
16/09/19 12:04:02 INFO mapred.JobClient:     CPU time spent (ms)=4580
16/09/19 12:04:02 INFO mapred.JobClient:     Physical memory (bytes) snapshot=452489216
16/09/19 12:04:02 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2948849664
16/09/19 12:04:02 INFO mapred.JobClient:     Total committed heap usage (bytes)=127401984
16/09/19 12:04:02 INFO mapreduce.ExportJobBase: Transferred 1.7363 KB in 31.5282 seconds (56.394 bytes/sec)
16/09/19 12:04:02 INFO mapreduce.ExportJobBase: Exported 26 records.
[hdfs@localhost:/mssqltips]$ 

Notice the last line of output above shows that 26 records were exported. This corresponds to the 26 records in the file on the HDFS. When we query the count of rows in our database table, we see that 26 rows are present as displayed in the image below. Also in the image below, we see the first 17 records returned by the query.

SQL Server table row count and sample data
Next Steps

Please experiment with different data types. Also, please check out these other tips and tutorials on big data and Sqoop on MSSQLTips.com.



Last Update:






About the author
MSSQLTips author Dallas Snider Dr. Dallas Snider is an Assistant Professor in the Computer Science Department at the University of West Florida and has 18+ years of SQL experience.

View all my tips
Related Resources





Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Notify for updates 


SQL tips:

*Enter Code refresh code     



Monday, December 05, 2016 - 6:00:41 PM - bass_player Back To Top

PolyBase in SQL Server 2016 can be used to pull data from HDFS/Hadoop and into SQL Server. This can be accomplished in a simple SELECT * (from the external table) INTO (SQL Server table) 


Thursday, October 13, 2016 - 11:05:54 AM - Marc Jellinek Back To Top

What other ways are there to pull data from HDFS into a SQL Server table?

Could Polybase play a role (now that it ships with all versions of SQL Server 2016)?

How could this be done using SSIS?  Or Azure Data Factory?


Learn more about SQL Server tools