ETL Interview Questions for T-SQL, Database Design and Data Loads - Part 4

By:   |   Updated: 2022-11-01   |   Comments   |   Related: 1 | 2 | 3 | 4 | More > Professional Development Interview Questions BI


Problem

This tip series has focused on ETL interview questions. This latest tip will highlight more T-SQL-related questions. Other parts of this series can be found here:

Solution

The questions featured in this tip are questions the author would ask if he were the interviewer, while some questions also come from actual interviews. If you know of other interesting questions, feel free to add them in the comments.

Interview Questions

Can you give an example of how you would use window-SQL-related functions in T-SQL to solve a problem?

Window functions are a more advanced topic of SQL functions, making it a good question for the interviewer to see the extent of your SQL knowledge. When creating ETL pipelines and ELT pipelines, you typically need to write a fair amount of SQL. A good ETL developer, therefore, has a good grasp of all SQL concepts.

Examples of window functions include LAG/LEAD, ROW_NUMBER, RANK, SUM with an OVER clause, etc. A couple of use cases:

  • You can use ROW_NUMBER() to determine and remove duplicate rows. An example is given in the tip T-SQL Tips and Tricks.
  • When you combine SUM with an OVER clause, you can calculate sub and grand totals. Using the ROWS clause to limit rows, you can calculate running totals or moving averages.
  • With PERCENTILE_DISC or PERCENTILE_CONT, you can calculate the median of a set of values.

You can find other examples and more detail in the SQL Server T-SQL Window Functions Tutorial.

What are common table expressions?

This is another SQL question to test your understanding of the query language. Common Table Expressions use the WITH clause to assign a name to a subquery. Suppose we have the following query:

SELECT a, b, c, d = a + c
FROM (SELECT a, b, c = a + b FROM myTable) tmp;

With a common table expression (CTE), the query looks like this:

WITH cte_tmp AS
(
    SELECT a, b, c = a + b FROM myTable
)
SELECT a, b, c, d = a + c
FROM cte_tmp;

Even though the query is a bit longer, it's more readable. CTEs are typically used to make long, complex queries more readable and understandable, as explained in the tip How to use SQL Server CTEs to make your T-SQL code readable by humans.

A more advanced use case for CTEs is recursive queries. You can learn all about them in the tip SQL Server Common Table Expressions (CTE) usage and examples.

Why are cursors bad in SQL Server?

Cursors are typically frowned upon because they deal with data on a row-by-row basis. However, SQL is a set-based language, so performance-wise, it's better to deal with data in sets instead of individual rows. If you use cursors on a large data set, you will likely run into performance issues. You can find an example in the tip SQL Server Convert Cursor to Set Based.

A method for imitating looping behavior can be done using a "numbers" table (also called a tally table). You can find examples in these tips:

Cursors aren't always bad, however. Suppose you need to loop over SQL Server objects, such as user databases or tables, then you can use a cursor or a WHILE loop. You can find more information about cursors in the tip SQL Server Cursor Example.

How do you check for data integrity?

Data integrity is typically maintained in a database by implementing primary and foreign key constraints. Sometimes in DWH environments, foreign keys are not created between the surrogate keys of the fact table and the corresponding primary keys in the dimensions. This is done to improve load performance. The general idea is that the ETL itself should be responsible for maintaining the integrity. For example, when loading a fact table, the process should check if a dimensions member is already present (e.g., during a lookup in SSIS). If it isn't, you can insert a dummy surrogate key (-1 is a popular value) or insert an inferred member in the dimension (also known as late arriving dimensions). Check out the tip Handle Early Arriving Facts in SQL Server Integration Services for more information.

Another check is implementing unique constraints on the dimensions. You don't want multiple members with the same business key in a dimension. If you did a lookup from a fact table, this would mean multiple rows can be returned, which leads to duplication of fact data. In the case of a type 2 slowly changing dimension, a lookup should match the business key and filter on the time slices.

Let's illustrate this with an example. In the following script, a lookup is done from the sales fact table to the customer dimension, which is a type 2 dimension:

SELECT *
FROM dbo.FactSales f
JOIN dbo.DimCustomer c ON f.SK_Customer = c.SK_Customer
            AND f.SK_SalesDate BETWEEN c.ValidFrom and c.ValidTo;

What are columnstore indexes?

Columnstore indexes were introduced in SQL Server 2012 as a new index type. Instead of storing data in a row-by-row fashion, as it was always done in the data pages, data is stored in a columnar fashion. This can lead to high compression ratios, saving a lot of IO and memory when dealing with queries. The original columnstore indexes were non-clustered and not updateable. In later versions, a clustered columnstore index was introduced that was updateable.

You can learn more about columnstore indexes in this webinar, Supercharge Your SQL Server Data Warehouse with Columnstore Indexes.

Suppose you needed to load 100 tables from a source. How do you deal with this efficiently?

Ideally, you do not develop 100 different artifacts. In ADF, you can use the auto-mapping capabilities of the Copy Data activity together with parameterization inside a For Each Loop. This will allow you to develop only one pipeline to handle the load of 100 tables. This pattern is described in the ADF tutorial, Building Flexible and Dynamic Azure Data Factory Pipelines.

In SSIS, however, metadata is fixed inside a data flow. You can use the Import/Export wizard of SSMS to generate a single package with 100 data flows. Performance-wise, this might not be the best idea since they will not run in parallel. Another option is to generate 100 packages using the Biml language. Biml can read the metadata of the source system (if it's a relational database) and generate the packages for you. You can learn more about Biml in the tutorial (Introduction to the Biml Language) and this webcast (Introduction to Biml - Generating your SSIS Packages from Scratch).

Next Steps





get scripts

next tip button



About the author
MSSQLTips author Koen Verbeeck Koen Verbeeck is a BI professional, specializing in the Microsoft BI stack with a particular love for SSIS.

View all my tips


Article Last Updated: 2022-11-01

Comments For This Article

















get free sql tips
agree to terms