SnowPro Advanced Architect Certification Preparation Essentials

Problem

Preparation for Snowflake certification exams can be a very complex process, specifically for advanced exams such as the SnowPro Advanced Architect Exam. Even with extensive hands-on experience, dedication and diligent preparation you can achieve a passing score. Preparation, in turn, can include reading extensive materials and documentation to delve deeply into the topics or to review specific details. Moreover, in the case of recertification, which usually happens every two years, it is necessary to repeat this learning and rehearsal process. Thus, having the main Snowflake concepts and topics required for the exam highlighted and documented in a shortened format can be really helpful when preparing for the exam within a short period of time.

Solution

In this article, we summarize some important Snowflake concepts that often appear during the exam, based on my experience. Please note that it is assumed the reader has background knowledge of the Snowflake concepts discussed in this article and is already mostly prepared for the exam; therefore, this article can serve as an exam cram for review. Please also note that we discuss only specific concepts in this article, which, in my view, appear frequently in the Snowflake Architect exam; in the real exam, however, the variety of topics can be much broader.

This is the sixth year since I passed the SnowPro Advanced Architect exam, and therefore, I am taking the re-certification exam for the second time. Wow, six years already… certification renewals are a very good indicator of how time flies. During this process, I noticed that questions about some specific concepts often appear in the exams. I selected these topics and prepared a brief description highlighting some of their key features that, in my opinion, can be beneficial during the exam. Also, I think it would be very handy to have this cram and use it right before the exam to refresh your memory or use it as a kind of guide to follow the main concepts and dive deeper using the provided links, if needed. Please note that the concepts discussed below are in random order and not organized according to Snowflake’s exam topic breakdown.

Let’s begin!

Clustering

Snowflake automatically clusters data in tables. However, over time, the natural clustering may become less optimal, specifically due to DML operations. Therefore, Snowflake supports defining a clustering key on tables to optimize data retrieval based on specific criteria. A clustering key consists of one or more columns in a table (or an expression on a table), and based on the key order, data is reorganized into micro-partitions to enable more efficient micro-partition pruning and query performance based on the key column(s). A table is considered clustered when a clustering key is defined on it. Materialized views can also be clustered.

The clustering depth of a table based on specific columns represents the average depth (1 or greater) of the overlapping micro-partitions for those columns. A smaller depth is an indicator of better clustering.

Some general indicators to determine if clustering may be needed:

Noticeably degraded query performance
Large clustering depth of the table

Benefits of clustering:

Improved query scan efficiency and performance through reduced micro-partition scanning
Improved column compression
Automatic maintenance of clustered tables by Snowflake, helping maintain effective clustering

Criteria for table clustering (all should meet for the best match):

The table is a large, multi-terabyte table
Queries on the table are highly selective and frequently use sorting
Many queries use the same small set of columns for filtering and sorting

Clustering key selection recommendations:

Maximum of 3 – 4 columns (or expressions) for a clustering key
Choosing columns intensively used in selective filters, and if the number does not exceed the recommended maximum, adding columns actively used in JOIN predicates
Columns (or expressions) with optimal cardinality (number of distinct values) to benefit from clustering — not too large and not too small — allowing optimal pruning and grouping of rows into micro-partitions.
If it is necessary to define a clustering key on a column with very high cardinality, it is recommended to define an expression on the column to reduce distinct values and use the expression as a key, assuming that the expression preserves the original ordering of the column values.
For composite clustering keys, it is recommended to order columns from lowest to highest cardinality.

Clustering costs:

Compute costs – when data in a clustered table becomes degraded due to DML operations, Snowflake performs reclustering to maintain effective clustering by reorganizing data according to the clustering key. These processes require computing resources and therefore consume credits.
Storage costs – during reclustering, new micro-partitions are created to physically regroup data based on the clustering key. Meanwhile, the old micro-partitions are marked as deleted, but not physically removed for some period to support Time Travel and Fail-safe. Therefore, this adds storage costs.

Monitoring clustering information:

SYSTEM$CLUSTERING_INFORMATION – This system function can be used to get clustering information, such as:
- Average clustering depth for a table based on one or more columns in the table
- Total number of micro-partitions
- Average number of overlapping micro-partitions for each micro-partition in the table
- Average overlap depth of each micro-partition in the table, etc.
SYSTEM$CLUSTERING_DEPTH – This system function can be used to retrieve the average depth of the table by the specified columns (or the clustering key of that table).

Materialized Views

A materialized view stores pre-computed data physically, which can make querying the materialized view faster when the query can be satisfied by the materialized view definition, compared to querying the base table.

Required Snowflake edition:

Enterprise and higher

Some use cases that can benefit from materialized views:

The number of rows and/or columns returned by the query is relatively small compared to the base table
Queries requiring complex processing (for example, semi-structured data analysis and heavy aggregations)
Queries on external tables
The base table of the view does not change frequently

Benefits of materialized views:

Can improve query performance for frequently used queries
Automatically maintained by Snowflake
Data in materialized views is automatically kept in sync with updates to the base table

Rules of thumb for choosing between materialized views and regular views:

If all these conditions are met, it is recommended to create a materialized view:
- The base table, or at least the rows from the base table used in the materialized view, do not change frequently
- The result of the view is used frequently (much more often than the results change)
- The query is very resource-intensive
If any of these conditions are met, it is recommended to create a regular view:
- The view results change frequently
- The view is not frequently used
- The query is not costly

Costs:

Materialized views consume storage as the query results are physically stored
Materialized views consume compute resources, as Snowflake uses background processes to automatically synchronize the materialized view with the base table and keep it up to date

Some limitations:

There are many limitations to creating materialized views. Below are some selected ones:
- A materialized view can refer to only one base table
- It cannot contain joins, even self-joins
- It cannot query views (materialized and non-materialized), hybrid and dynamic tables, and user-defined table functions
- It cannot contain HAVING, ORDER BY, or LIMIT clauses; user-defined functions; window functions; or MINUS, INTERSECT, and EXCEPT operators, as well as many aggregate functions

Some additional facts and functionalities of materialized views useful for the exam:

A materialized view can be defined as a secure view
Materialized views cannot be directly cloned. However, they can be cloned indirectly by cloning the database or schema containing the materialized view
Historical data of materialized views is stored by Snowflake; however, materialized views cannot be queried at a point in time in the past using Time Travel. That said, Time Travel can be used to clone materialized views at a specific point in time.
If new columns are added to the base table, these columns are not automatically reflected in the materialized view, even if SELECT * is used. Therefore, using SELECT * is not considered best practice when defining materialized views.
Changing or dropping columns of the base table (even columns that are not part of the materialized views) causes all materialized views on that table to be suspended. They cannot be resumed and must be recreated to be used again.
Dropping the base table will cause materialized views on that table to be suspended, but they are not dropped automatically.

Cloning

The zero-copy cloning feature in Snowflake enables the easy copying of databases, schemas, tables, and other schema objects without duplicating storage initially. However, once changes are made to the cloned object, these changes are tracked independently and therefore consume storage. Below, let’s highlight some important characteristics of cloning that can be useful during the exam.

Some important facts about cloning:

Clones can also be cloned, with no limit on the number of iterations.
All child objects of cloned databases and schemas inherit all corresponding privileges grantedfrom the corresponding source objects. However, the parent cloned object (database or schema) does not inherit privileges granted on the source.
When cloning a database or schema, external named stages are cloned. Internal named stages, however, are not automatically cloned. To include internal stages in the clone, the INCLUDE INTERNAL STAGES clause should be used.
When the cloned object contains a pipe reference to an internal stage, that pipe is not cloned. However, pipes referencing external stages are cloned.
Temporary tables can be a source for cloning temporary and transient tables. Transient tables can also serve as a source for creating temporary and transient table clones. Permanent tables can be cloned into permanent, transient, and temporary tables. Thus, for permanent table clones, only permanent tables can serve as a source.
Objects can be cloned using Time Travel to clone from historical data using the AT | BEFORE clause. If the object is missing at the specified time, an error will be generated.

Query profile and query monitoring

The Query Profile provides a visual interface to understand query execution in a detailed way. It can be used to analyze and identify the causes of problems in queries. In the exam, it is possible to have a question that requires interpreting a given Query Profile and identifying the problem. Let’s check some common query performance problems that can be asked during the exam.

Some common problems that can be caught by the query profile:

Insufficient pruning – If in a Table Scan operator, statistics ‘Partitions scanned’ is significantly smaller than ‘Partitions total’, this can mean efficient pruning. Otherwise, it can indicate poor pruning.
“Exploding” joins – If a join operator is time-consuming and the output records of the join are significantly greater than the input records, this can indicate a mistake where the join condition is missing, causing a “Cartesian” join, in which records from one table match all records in the other table. This is also called an “exploding” join.
Too large queries – For large queries that consume too much memory, the server memory can be insufficient, causing data to spill to the local disk; if the local disk is not sufficient, data can further spill to the remote disk. This behavior can be detected using the Query Profile.
UNION instead of UNION ALL – The difference between UNION and UNION ALL is that UNION performs deduplication and therefore is a more costly operation compared to UNION ALL. If duplicates are not expected or deduplication is not needed, it is recommended to use UNION ALL. However, sometimes UNION is used unnecessarily, causing performance issues. This can be detected in the Query Profile by an extra Aggregate operator with the UNION ALL operator, which performs deduplication.

Some questions in the exam can be related to query activity monitoring. So, below are some resources for monitoring query activity:

The Query History page in Snowsight
The Grouped Query History page in Snowsight
The QUERY_HISTORY view in the ACCOUNT_USAGE schema
The AGGREGATE_QUERY_HISTORY view in the ACCOUNT_USAGE schema
The QUERY_HISTORY family of table functions in INFORMATION_SCHEMA

Kafka Connector

This Snowflake connector for Kafka, also known as the Kafka connector, is used to read data from Kafka topics and write the data into Snowflake tables.

Data Loading Methods:

Snowpipe
Snowpipe Streaming

Supported formats of passing messages:

JSON
AVRO

Workflow:

The Kafka connector subscribes to one or more Kafka topics, and the objects created for each topic are the following:
- One internal stage per topic
- One pipe for each topic partition
- One table for each topic (If the specified table does not exist, the connector creates it. If it exists, adds the corresponding columns-RECORD_CONTENT and RECORD_METADATA)

The table schema loaded by Kafka:

If created by the Kafka connector, it contains two VARIANT columns:
- RECORD_CONTENT – Contains the Kafka message
- RECORD_METADATA – Contains metadata about the message
If the table exists, RECORD_CONTENT and RECORD_METADATA are added, and checks if other columns are nullable (throws an error if they are not nullable)

Conclusion

To sum up, we highlighted some important concepts and details that have a high probability of appearing in the exam. This is an important, yet small, portion of the exam material. In the next articles, we will explore more exam-related concepts and briefly summarize the most important ones.

Next Steps

For additional information, please follow the links below:

Sergey Gigoyan

Sergey Gigoyan (LinkedIn) is a Senior Technical Architect specializing in data and databases with more than 15 years of experience. Sergey focuses on modern data architectures, database design and development, performance tuning and optimization, high availability solutions, BI development and DW design. He has worked with SQL Server, Oracle, and PostgreSQL databases, as well as cloud-based data solutions (AWS and Azure). Sergey also has extensive experience with modern data stacks such as Snowflake and dbt.

Sergey’s experience spans various industries. He had the privilege of working with IT giants such as Oracle as a Principal Data Engineer and BlackBerry as well as innovative startups. He helped deliver complex database solutions and advanced data strategies.

Sergey is also the author of “Building a Successful Career in IT – How I Did It” where he provides actionable advice on thriving in the ever-evolving IT industry.

MSSQLTips Awards: Champion (100+ tips) – 2024 | Author of the Year – 2020