ETL Interview Questions for SSIS and ADF - Part 2
In this second part of the tip, we continue with interview questions related to ETL; this time with a focus on SSIS and ADF.
The questions featured in this tip are questions the author would ask if he was the interviewer, but some questions also come from actual interviews. If you know of other interesting questions, feel free to add them in the comments. You can read part 1 here.
What is your experience with SSIS?
With questions like these, the interviewer wants to find out how well you know the product. Depending on the position, they're looking for someone with basic, intermediate, or expert knowledge. This should be indicated in the job description. Explain how long you've been working with SSIS, what kinds of projects you've done, and which successes you've had. For example, how you loaded gigabytes of data efficiently into a database or that you used SSIS to load a complex data warehouse. If you don't have experience with SSIS, explain which trainings you've taken or which certifications you've done or are planning to do. You can tell the interviewer how you would remediate your lack of experience, for example by following some online tutorials.
The interviewer might ask some follow-up questions, like how you would tackle a specific problem in SSIS or what a certain task is used for. If you don't know the answer, that's OK. Explain you haven't had the chance to work on that particular area yet but tell how you would research the answer to that question. Whatever you do, don't lie or exaggerate your experience.
The following tip has some interview questions specifically for SSIS: SQL Server Developer Integration Services SSIS Interview Questions. It's quite old, but many concepts in SSIS are still the same.
What do you think are some shortcomings of SSIS?
This is an open-ended question. Such questions allow for a bit of debate and discussion but more importantly, they can easily tell the interviewer how much experience someone has with SSIS. Someone who has a lot of experience with SSIS can easily sum up a few items, while someone who has barely opened the product will struggle with this question.
Here are a few shortcomings:
- Metadata in the data flow is fixed. This means if you need to load 10 Excel files and they all have a different structure, you'll need 10 data flows. Schema drift can make a package crash if you're not careful.
- There are many different sets of data types, making it confusing for the developer. For example, you have data types in the data flow (DT_NUMERIC, DT_WSTR, DT_DATE …), data types for parameters in the Execute SQL Task, and data types for the SSIS parameters/variables.
- It's not easy to generate SSIS packages. You need to rely on 3rd party solutions like Biml.
- Transactions and checkpointing are not that easy to implement.
- Some tasks/components are hard to configure, like the UNPIVOT command.
- When working with source control, it's hard to visually compare the changes between different versions of the package because of the complex XML that makes up an SSIS package.
- SSIS packages are developed within Visual Studio. The designer can be a bit buggy and Visual Studio can crash from time to time. This also means people using Linux or Mac can't develop SSIS packages.
- The defaults of the SSIS catalog are not so good when it comes to log retention. Because it keeps the logs around for a long time and the default recovery model is full, the SSISDB can become huge in size after some time. Check out the tips SQL Server Integration Services Catalog Best Practices and Managing the size of the SQL Server SSIS catalog database on how to manage this.
- There doesn't seem to be a lot of active development for SSIS from Microsoft, making the future of the product uncertain.
What are the strengths of SSIS?
Again, an open-ended question with the same purpose as the previous question.
- SSIS is a mature product. It has been around for over 15 years. There are many learning resources available on the web (a lot for free) and there are many developers out there who know how to work with SSIS.
- SSIS is quite fast. It used to hold the world speed record for ETL for a long time. When the data flow has been developed correctly – with no blocking transformations – it can easily handle millions of rows. If the source and destination can handle it, SSIS can "stream" data between the two, while transforming data on-the-fly.
- If some functionality is missing outside the box, you can easily extend a package by using external processes (like zipping a file), or by using .NET in a script task or script component. You can even build your own custom tasks if you want to. There's also a whole ecosystem of custom task providers to extend SSIS.
- If you want, you can migrate your SSIS projects as-is to ADF using the Azure-SSIS IR. Check out the webinar Migrating SQL Server Integration Services to the Cloud on how to do this.
- If you already have a SQL Server license, SSIS is free if you install it on the same machine.
What is your experience with ADF?
This is basically the same question as the one about SSIS. ADF hasn't been around as long as ADF, so it's a bit more difficult to already have years of experience in it. Just lay out your projects, training, tutorials etc., everything that contributes to you building up experience. And explain how you would tackle any knowledge gaps.
What do you think are some shortcomings of ADF?
With questions like these, don't just bash on the product alone. It might show you have experience working with the product, but the interviewer might wonder if you would actually like working on the project. You can detail some shortcomings, but at the same time, you should try to detail a work around. This will show that you know the product and that you can think constructively.
- ADF can be costly. Since it's a cloud offering, everything costs money. Especially the Copy data activity can be expensive if you execute it in parallel in a For Each loop since each execution is billed for at least one minute. There are ways to mitigate costs, such as lowering the number of DIU used, replacing the Copy data activity with a Logic App or Azure Function (external activities are much cheaper), or by executing compute on a self-hosted IR.
- There's functionality missing. For example, you cannot send an e-mail or execute a PowerShell script on an on-premises machine. Most missing functionality can be implemented elsewhere, such as in Azure Functions, Azure Logic Apps, Azure Databricks, or even Azure Automation. If all else fails, you can still run SSIS packages in an Azure-SSIS IR. You can find an example of sending an e-mail through Logic Apps in the tips Azure Data Factory Pipeline Email Notification – Part 1 and Part 2.
- It's not as mature yet as SSIS. Some features like using OR logic in precedence constraints don't exist in ADF. It's also not possible to create expressions on them, like in SSIS (see the tip Using Expressions On Precedence Constraints in Integration Services for more info). You cannot execute a single activity – which would be useful for debugging – or disable certain activities while debugging. You cannot nest multiple loops inside each other, you have to create separate pipelines for those.
What are the strengths of ADF?
Most of the strengths of ADF resolve weaknesses found in SSIS.
- ADF has very good integration with git. Because of the JSON code behind the pipelines, it's easier to visually compare different versions. Setting up a CI/CD pipeline is more straightforward than with SSIS.
- Unlike SSIS, metadata isn't necessarily fixed. You can define a column mapping, but you don't need to. You can leave the mapping blank and ADF will map columns automatically in the Copy Data activity.
- TThere's only one set of data types in ADF, but the number of available data types is limited though.
- Since it's JSON behind the scenes instead of complex XML, it's easier to generate pipelines. However, in many cases, you don't need to. You can use dynamic metadata mapping to create one single pipeline that transfers data from multiple sources. Check out Building Flexible and Dynamic Azure Data Factory Pipelines for more info.
- ADF is developed inside Azure Data Factory Studio, which is in the browser. This means anyone with an Internet connection and the required permissions can develop for ADF, no matter their operating system of choice.
- ADF is a serverless cloud solution. This means it can scale automatically if needed (e.g. for the Copy Data activity) and that it doesn't require any maintenance. Logging is taken care of automatically and because it's part of Azure, you can hook the logging into Azure Monitor and Azure Log Analytics.
- Unlike SSIS, there's still very active development going on for ADF.
When would you use data flows in ADF?
DData flows offer a visual designer to create ETL pipelines for big data. Behind the scenes, there's a Spark cluster running, and the visual pipeline gets translated to code. It's probably cheaper to write the code yourself in Azure Databricks or any other Spark cluster, but if you want to deal with big data and don' want/can/like to write code, Data Flows are a very good option.
SSIS or ADF, what would you choose?
As usual, it depends.
- For on-premises ETL, SSIS is still a very good choice, especially if you have a SQL Server license.
- For new projects in the cloud, ADF is in most cases the way to go.
- For cloud projects, you can opt for SSIS if you already have much knowledge in SSIS and minimal knowledge of ADF. You can always use the Azure-SSIS IR in ADF to run your SSIS packages and migrate later.
Keep in mind the strengths and weaknesses of both tools which were discussed in previous questions. Check out the tip Choosing Between SQL Server Integration Services and Azure Data Factory for more detail.
Do you have experience with other tools in the Microsoft Data Platform?
IIf you're working in Azure, chances are you're not just going to work with Azure Data Factory alone. For certain use cases, you might need complementary features from other products like Azure Functions, Azure Logic Apps, or Azure Databricks. Azure Synapse Serverless SQL Pools or Spark Pools are also interesting to get in your toolbelt. As usual with these types of questions, answer honestly and explain how you would resolve any knowledge gaps.
- Do you know more ETL-related interview questions? Feel free to post them in the comments!
- Stay tuned for more tips with ETL interview questions!
- For a comparison of tools, check out Azure Data Factory vs SSIS vs Azure Databricks. Also check out this overview of ETL tools in the Microsoft Data Platform: part 1 and part 2.
- There are even more tips with SSIS interview questions:
About the author
View all my tips
Article Last Updated: 2022-08-01