ETL Interview Questions for the Microsoft Data Platform on Performance, Troubleshooting, and more - Part 3
We're continuing the series on interview questions related to ETL. If you haven't already, check out part 1 and part 2. This part focuses on interview questions geared toward data warehousing and troubleshooting.
The questions featured in this tip are questions the author would ask if he were the interviewer, but some questions also come from actual interviews. If you know other interesting questions, feel free to add them in the comments.
What are slowly changing dimensions?
Slowly changing dimensions is a design technique for dimension tables to track history. There are different types of dimensions:
- Type 1: no history is kept. If properties change for a certain dimension member, they are updated.
- Type 2: history is kept by inserting new rows. If properties change for a dimension member, instead of updating the fields, a new row is inserted into the dimension. To keep track of the chronological order, valid from and valid to date fields are added to the dimensions.
- Type 3: only the previous value is kept. This is typically used in migration scenarios. An example is where you have a column current email and a column previous email.
There are other types of slowly changing dimensions, but these are combinations of the basic three types. To learn more about how you can implement this in SSIS, check out the following tips:
- Implement a Slowly Changing Type 2 Dimension in SQL Server Integration Services - Part 1
- Optimizing Type 2 Slowly Changing Dimensions in Integration Services - Part 2
- Additional Optimizations for Type 2 Slowly Changing Dimensions in SQL Server Integration Services - Part 3
A query is running slow in the data warehouse, how do you investigate?
This is an open-ended question to test how you tackle more complex problems. When you explain your methods, it allows the interviewer to see how you think, research and approach issues.
One advantage of a data warehouse or ETL processes is that there aren't many ad-hoc queries (unless you have DirectQuery from Power BI on your data warehouse). This makes it a bit more straightforward for performance tuning, as the set of possible queries is limited. A troubleshooting process could look like this:
- First, you have to identify the query that is running slow. In a lot of cases, since someone is complaining, you already know which one. Maybe a view on top of a fact table is slow, or a report is not rendering. In other cases, you can use Extended Events, Profiler or Query Store to look at slow queries.
- Once you have the query, you can execute it in SSMS to look at the actual
query plan. If the plan never finishes, you can use the estimated query plan.
In the plan, search for the following:
- Big differences in actual vs estimated rows. Maybe statistics are out-of-date? Or the query is using user-defined functions so that accurate estimates cannot be made?
- Are there bookmarks or key lookups? This is a sign the indexes used are not covering. A solution is to modify the index to add the missing column to the index or add the column as an included column.
- Are there exclamation marks? This is a warning from SQL Server. There might be implicit conversions, not enough memory, spilling to disk, etc.
- There might other things going on, such as parameter sniffing. This can cause for example a nested join to be used instead of a merge join (or the other way around).
- Once you know (or suspect) the cause, you can take action. Maybe some indexes need to be created or adjusted, or the query needs to be rewritten. In more severe cases, some remodeling needs to be done.
How can you improve load performance?
Like the previous question, there are a lot of possible answers to this question, and it's designed to show your expertise and how you tackle problems. Here are a few options:
- Disable indexes and/or foreign keys if you're loading into a database table.
- Try to load in parallel if possible.
- If your source is many small files (for example parquet files in a data lake), you can combine them first to bigger files to improve performance.
- If possible, change the SQL Server recovery model to simple. This model has much less logging overhead than the full recovery model. Especially in data warehouse scenarios it makes sense to use simple. Keep in mind Azure SQL Database has only the full recovery model. The Data Loading Performance Guide is a good resource to check whether you're actually using minimally-logged operations.
- Try to insert in big batches, not row-by-row inserts, but make sure the transaction log can handle it, so if one big batch is too big, split it up into smaller batches.
- If the destination table has a clustered columnstore index, try to insert only if you can fill a full columnstore segment (around 1 million rows).
- You can speed up inserts by using partition switching.
- In SSIS, use the fast load option when inserting in a table. There are other tools that can bulk insert as well, like bcp and the BULK INSERT command.
The nightly ETL job has crashed. How do you start your investigation?
The first thing should be to track down the error message. This can be in the logging of the SQL Server Agent job, or in the monitor pane of ADF for example. Depending on the error, there are some actions you can take:
- Is the error transient? For example, was there a network issue that is now gone? Was there a DNS issue? Sometimes the solution is to just restart the job.
- Is the error external to the ETL? In other words, outside the scope of the ETL process. An example could be that the hard drive is full, or that the transaction log has reached its limit. I've also seen permission issues or firewall issues causing an ETL job to fail. In such cases, you might have to work with admins of other teams to get the issue resolved.
- If the error is from within the ETL process, you need to determine what
- Is it a technical error? Such as data truncation, primary key violation, trying to insert a NULL value into a non-nullable column. In such cases, the ELT developer must take corrective action.
- Is it more a business-related error? For example, a bug in the logic causing a sudden "divide by zero error", or incorrect values entered into the source system. In those cases, the ETL developer might have to work with the business users to determine the solution for the error.
Business users are complaining the data is not correct in the reports. They suspect it's the data warehouse. How do you investigate?
The first thing one must always do is ask for more information. Often you just get a screenshot of the report with the comment that it is not correct. Try to find out which report was exactly used (in the case of Power BI which tab we're looking at), on which environment (Q&A, production) and especially, which parameters or filters were used to render the report. Then ask what values they are expecting.
Then you can either start a bottom-up approach or a top-down approach. In the bottom-up, you directly look at the data in the source systems and you try to figure out if the data is correct. Sometimes there's just an error or incorrect data in the source system. If the data seems correct, work your way up (staging area, data warehouse, semantic model till the report) until you find the place where the data doesn't match with the expected results anymore. This investigative process usually takes some time and a bit of SQL querying.
In the top-down approach, you take the opposite direction. You start from the report (do you get the same incorrect values when you run the report) and you go down. When you write a query on the facts and the dimensions, is the data still correct, and so on.
- Stay tuned for more interview questions!
- Check out the SQL Server Performance Tuning and Monitoring Tutorial and A Day in the Life of a BI Developer
About the author
View all my tips
Article Last Updated: 2022-08-10