Learn more about SQL Server tools

 

Tutorials          DBA          Dev          BI          Career          Categories          Videos          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Events

DBA    Dev    BI    Categories

 

Using Sqoop WHERE Argument to Filter Data from a SQL Server


By:   |   Read Comments   |   Related Tips: More > Big Data

Problem

I need to load a specific subset of data from a SQL Server table to a Hadoop Distributed File System. Can I use a WHERE clause to restrict the data loaded to the Hadoop Distributed File System?

Solution

Yes, you can. Apache's Sqoop has a where argument that will allow you to pass in a WHERE clause to restrict the rows returned. In this example, we will load the records of the female customers in the Customer Dimension table in the SQL Server 2014 AdventureWorksDW2014 database. In the previous tips in this series, we did not restrict the rows returned. As with previous tips, this tip is written using SQL Server 2014 and a Cloudera virtual machine.

The Sqoop command we will examine is listed below and we will break down each argument in the following bullet points. Please note that the command is supposed to be on one complete line or with the backslash (the Linux command line continuation character) at the end of each line except the last.

sqoop import --connect 'jdbc:sqlserver://aaa.bbb.ccc.ddd:pppp;databasename=AdventureWorksDW2014'              --username 'sqoopLogin' -P              --table DimCustomer              --where "Gender='F'"
  • sqoop import - The executable is named sqoop and we are instructing it to import the data from a table or view from a database to HDFS.
  • --connect - With the --connect argument, we are passing in the JDBC connect string for SQL Server. In this case, we use the IP address, port number, and database name.
  • --username - In this example, the user name is a SQL Server login, not a Windows login. Our database is set up to authenticate in mixed mode. We have a server login named sqoopLogin, and we have a database user name sqoopUser which is a member of the db_datareader role and has a default schema of dbo.
  • -P  - This will prompt the command line user for the password. If Sqoop is rarely executed, this might be a good option. There are multiple other ways to automatically pass the password to the command, but we are trying to keep it simple for this tip.
  • --table - This is the argument that will pass in the name of the table.
  • --where - This is the argument that will contain our WHERE clause. Notice how the clause is enclosed in double quotes.

The output from the sqoop command is shown below.

[hdfs@localhost ~]$ ./sqoopCommand.sh 
16/08/14 19:46:03 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.2.0
Enter password: 
16/08/14 19:46:29 INFO manager.SqlManager: Using default fetchSize of 1000
16/08/14 19:46:29 INFO tool.CodeGenTool: Beginning code generation
16/08/14 19:46:29 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM [DimCustomer] AS t WHERE 1=0
16/08/14 19:46:29 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-0.20-mapreduce
Note: /tmp/sqoop-training/compile/1593548bf50d84bbd5bbb384a896cef4/DimCustomer.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
16/08/14 19:46:32 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-training/compile/1593548bf50d84bbd5bbb384a896cef4/DimCustomer.jar
16/08/14 19:46:32 INFO mapreduce.ImportJobBase: Beginning import of DimCustomer
16/08/14 19:46:34 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
16/08/14 19:46:35 INFO db.DBInputFormat: Using read commited transaction isolation
16/08/14 19:46:35 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN([CustomerKey]), MAX([CustomerKey]) FROM [DimCustomer] WHERE ( Gender='F' )
16/08/14 19:46:36 INFO mapred.JobClient: Running job: job_201607151351_0007
16/08/14 19:46:37 INFO mapred.JobClient:  map 0% reduce 0%
16/08/14 19:46:58 INFO mapred.JobClient:  map 50% reduce 0%
16/08/14 19:47:12 INFO mapred.JobClient:  map 75% reduce 0%
16/08/14 19:47:13 INFO mapred.JobClient:  map 100% reduce 0%
16/08/14 19:47:16 INFO mapred.JobClient: Job complete: job_201607151351_0007
16/08/14 19:47:16 INFO mapred.JobClient: Counters: 23
16/08/14 19:47:16 INFO mapred.JobClient:   File System Counters
16/08/14 19:47:16 INFO mapred.JobClient:     FILE: Number of bytes read=0
16/08/14 19:47:16 INFO mapred.JobClient:     FILE: Number of bytes written=1189344
16/08/14 19:47:16 INFO mapred.JobClient:     FILE: Number of read operations=0
16/08/14 19:47:16 INFO mapred.JobClient:     FILE: Number of large read operations=0
16/08/14 19:47:16 INFO mapred.JobClient:     FILE: Number of write operations=0
16/08/14 19:47:16 INFO mapred.JobClient:     HDFS: Number of bytes read=497
16/08/14 19:47:16 INFO mapred.JobClient:     HDFS: Number of bytes written=2371785
16/08/14 19:47:16 INFO mapred.JobClient:     HDFS: Number of read operations=13
16/08/14 19:47:16 INFO mapred.JobClient:     HDFS: Number of large read operations=0
16/08/14 19:47:16 INFO mapred.JobClient:     HDFS: Number of write operations=4
16/08/14 19:47:16 INFO mapred.JobClient:   Job Counters 
16/08/14 19:47:16 INFO mapred.JobClient:     Launched map tasks=4
16/08/14 19:47:16 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=65017
16/08/14 19:47:16 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=0
16/08/14 19:47:16 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
16/08/14 19:47:16 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
16/08/14 19:47:16 INFO mapred.JobClient:   Map-Reduce Framework
16/08/14 19:47:16 INFO mapred.JobClient:     Map input records=9133
16/08/14 19:47:16 INFO mapred.JobClient:     Map output records=9133
16/08/14 19:47:16 INFO mapred.JobClient:     Input split bytes=497
16/08/14 19:47:16 INFO mapred.JobClient:     Spilled Records=0
16/08/14 19:47:16 INFO mapred.JobClient:     CPU time spent (ms)=9430
16/08/14 19:47:16 INFO mapred.JobClient:     Physical memory (bytes) snapshot=441229312
16/08/14 19:47:16 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2954670080
16/08/14 19:47:16 INFO mapred.JobClient:     Total committed heap usage (bytes)=127533056
16/08/14 19:47:16 INFO mapreduce.ImportJobBase: Transferred 2.2619 MB in 43.6794 seconds (53.0272 KB/sec)
16/08/14 19:47:16 INFO mapreduce.ImportJobBase: Retrieved 9133 records.
[hdfs@localhost ~]$ 

Notice the last line of output above shows that 9,133 records were retrieved. This corresponds to the 9,133 records in the T-SQL query results as we see in the image below. Also in the image below, we see the first 10 records in the DimCustomer table.

table row count and top 10 rows from the sqoop import

After executing the sqoop command, we can execute the hdfs dfs -ls command below to see the directory that was created by default with the table name on the HDFS.

[hdfs@localhost ~]$ hdfs dfs -ls
Found 1 items
drwxrwxrwx   - training supergroup          0 2016-08-14 19:47 DimCustomer
[hdfs@localhost ~]$ 

We can use the hdfs dfs -ls command below to list the contents of the DimCustomer directory. Notice how 4 partitions were created by default.

[hdfs@localhost ~]$ hdfs dfs -ls DimCustomer
Found 6 items
-rw-rw-rw-   1 training supergroup          0 2016-08-14 19:47 DimCustomer/_SUCCESS
drwxrwxrwx   - training supergroup          0 2016-08-14 19:46 DimCustomer/_logs
-rw-rw-rw-   1 training supergroup     601722 2016-08-14 19:46 DimCustomer/part-m-00000
-rw-rw-rw-   1 training supergroup     592323 2016-08-14 19:46 DimCustomer/part-m-00001
-rw-rw-rw-   1 training supergroup     593487 2016-08-14 19:47 DimCustomer/part-m-00002
-rw-rw-rw-   1 training supergroup     584253 2016-08-14 19:47 DimCustomer/part-m-00003
[hdfs@localhost ~]$

The hdfs dfs -cat command below will display the first 10 records in the first partition on the HDFS. We can use this command to verify the correct records were loaded.

[hdfs@localhost ~]$ hdfs dfs -cat DimCustomer/part-m-00000|head
11003,11,AW00011003,null,Christy,null,Zhu,false,1973-08-14,S,null,F,christy12@adventure-works.com,70000.0000,0,0,Bachelors,Licenciatura,Bac + 4,Professional,Profesional,Cadre,0,1,1825 Village Pl.,null,1 (11) 500 555-0162,2010-12-29,5-10 Miles
11004,19,AW00011004,null,Elizabeth,null,Johnson,false,1979-08-05,S,null,F,elizabeth5@adventure-works.com,80000.0000,5,5,Bachelors,Licenciatura,Bac + 4,Professional,Profesional,Cadre,1,4,7553 Harness Circle,null,1 (11) 500 555-0131,2011-01-23,1-2 Miles
11006,8,AW00011006,null,Janet,G,Alvarez,false,1976-12-02,S,null,F,janet9@adventure-works.com,70000.0000,0,0,Bachelors,Licenciatura,Bac + 4,Professional,Profesional,Cadre,1,1,2612 Berry Dr,null,1 (11) 500 555-0184,2011-01-24,5-10 Miles
11008,32,AW00011008,null,Rob,null,Verhoff,false,1975-07-04,S,null,F,rob4@adventure-works.com,60000.0000,4,4,Bachelors,Licenciatura,Bac + 4,Professional,Profesional,Cadre,1,3,624 Peabody Road,null,1 (11) 500 555-0164,2011-01-25,10+ Miles
11010,22,AW00011010,null,Jacquelyn,C,Suarez,false,1969-08-05,S,null,F,jacquelyn20@adventure-works.com,70000.0000,0,0,Bachelors,Licenciatura,Bac + 4,Professional,Profesional,Cadre,0,1,7800 Corrinne Court,null,1 (11) 500 555-0169,2011-01-14,5-10 Miles
11012,611,AW00011012,null,Lauren,M,Walker,false,1979-01-14,M,null,F,lauren41@adventure-works.com,100000.0000,2,0,Bachelors,Licenciatura,Bac + 4,Management,Gestión,Direction,1,2,4785 Scott Street,null,717-555-0164,2013-03-16,1-2 Miles
11014,634,AW00011014,null,Sydney,null,Bennett,false,1973-11-06,S,null,F,sydney23@adventure-works.com,100000.0000,3,0,Bachelors,Licenciatura,Bac + 4,Management,Gestión,Direction,0,3,9011 Tank Drive,null,431-555-0156,2013-03-23,1-2 Miles
11015,301,AW00011015,null,Chloe,null,Young,false,1984-08-26,S,null,F,chloe23@adventure-works.com,30000.0000,0,0,Partial College,Estudios universitarios (en curso),Baccalauréat,Skilled Manual,Obrero especializado,Technicien,0,1,244 Willow Pass Road,null,208-555-0142,2013-01-18,5-10 Miles
11017,39,AW00011017,null,Shannon,null,Wang,false,1949-12-24,S,null,F,shannon1@adventure-works.com,20000.0000,4,0,High School,Educación secundaria,Bac + 2,Skilled Manual,Obrero especializado,Technicien,1,2,7330 Saddlehill Lane,null,1 (11) 500 555-0195,2011-01-12,5-10 Miles
11021,536,AW00011021,null,Destiny,null,Wilson,false,1984-03-02,S,null,F,destiny7@adventure-works.com,40000.0000,0,0,Partial College,Estudios universitarios (en curso),Baccalauréat,Skilled Manual,Obrero especializado,Technicien,0,1,8148 W. Lake Dr.,null,622-555-0158,2013-01-23,1-2 Miles
[hdfs@localhost ~]$
Next Steps


Last Update:






About the author
MSSQLTips author Dallas Snider Dr. Dallas Snider is an Assistant Professor in the Computer Science Department at the University of West Florida and has 18+ years of SQL experience.

View all my tips
Related Resources


 









Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Notify for updates 


Get free SQL tips:

*Enter Code refresh code     



Learn more about SQL Server tools