Why Choose Snowflake?
I'm leading a Business Intelligence Competence Center at my company. Our team is responsible for the data warehouse the data models and reporting on top of those models. Our current data warehouse is in dire need of an upgrade. It is a traditional data warehouse that still runs on on-premises servers on an old version of SQL Server. I'm hearing a lot of good stuff about a new data warehouse vendor, Snowflake. I'm wondering if we should migrate workloads to Snowflake in the cloud or not.
Snowflake is a data warehouse vendor and its database is a cloud data warehouse offering which is available on Microsoft Azure, but also on Amazon Web Services (AWS) and the Google Cloud platform. Snowflake has been gaining a lot of traction over the past years and has conquered market share from Microsoft Azure, Google, AWS and Oracle. Snowflake has had the biggest software IPO in history. The cloud is becoming more popular and there are many options for data platforms out there. In this tip, we'll cover some reasons why it might be a good idea to try out Snowflake from a "not too technical" perspective.
For an introduction to Snowflake and some of its features, check out the Snowflake tutorial.
It's in the Cloud
Snowflake being a cloud database vendor has several benefits. As it is available on multiple clouds (Amazon AWS, Azure and Google), it's sort of "neutral". You can choose the cloud vendor that suits you best, the Snowflake experience remains the same. Billing is done through Snowflake, not the cloud vendor.
Like most "as-a-service" offering in the cloud, the complexities of the set-up is hidden for you. You don't have to deal with choosing a server and configuring hardware. You only need to register at the Snowflake website, wait for about 15 minutes and your admin credentials will be e-mailed to you. You log in into the Snowflake portal and you can start building your data warehouse!
Backups, disaster recovery, high availability; everything is taken care off for you. All you have to worry about is loading data into Snowflake and write SQL. Snowflake has a very ANSI compatible dialect of SQL, so if you can write SQL for SQL Server, you can write SQL for Snowflake as well.
Scalability and Elasticity
Like most cloud services, the Snowflake database can easily scale in real-time to your demands. There are also options available for auto-growth, so the database becomes truly elastic. When you're working with an on-premises server, you have to calculate up-front how big your server needs to be. Typically, the server is scaled to manage the highest peak of demand (known at that time). But what if this peak only occurs periodically? For example, at the start of each month when the previous month is being closed by accounting? Or at the end of a quarter when sales tries desperately to get their sales bonus? Then the server sits idly for most of the time. You've paid a lot of money for that hardware (and possible for the software licenses as well), but you're not getting your best value out of it.
Migrating to the cloud means you can take advantage of its scalable and elastic nature. When demand rises, you simply scale up your warehouse so it can process all queries. If demand falls, you scale it back down. With Snowflake, you pay a fixed amount of money per terabyte of storage and you also pay for each minute the "compute" is running. In Snowflake, the compute power is called "warehouse" and it comes in a variety of sizes:
Suppose you're running an X-Small warehouse most of the time. If no queries come in, the warehouse will be suspended and you don't pay anything at all (except storage of course). Once a query comes in, the warehouse resumes running and you start paying for each second that it is running. When you need more compute power, you can simply switch to a bigger warehouse, for example Medium. This can even be automated with SQL. Medium is 4 times bigger than an X-Small warehouse and we can assume it has about 4 times the compute power. But, it also costs 4 times more (exact pricing depends on your licensing, but we can calculate using "credits"). For one hour of compute, you pay 1 credit for an X-Small warehouse, but 4 credits for a X-Small warehouse. However, because the Medium warehouse is faster, it's also possible your queries finish quicker. This means you don't need to keep your warehouse running as long as with the X-Small. It might be possible you pay the same amount of money, but your queries are quicker!
One of Snowflake best features is that scaling your warehouses up or down (or add additional warehouses into a cluster) is transparent. There's no downtime at all! Queries keep on running, connections are not dropped and once the warehouse has finished scaling (typically in a second or less), new queries can take advantage. Snowflake has completely separated compute and storage, and you can see the results in their impressive scaling.
On-premises, you need to buy your server and your licenses first, which is Capex spending. With Snowflake (and most cloud offerings), you get a monthly bill which is Opex spending. You can potentially save a lot of money by migrating to Snowflake. Unless you run compute all the time, than it can become really expensive.
Snowflake has impressive performance, even with an X-Small warehouse. You can easily run queries over millions of records, gigabytes of data and still have your result returned in seconds or less. The following query (which has a correlated subquery which is never ideal for performance) takes about 12 seconds on a X-Small warehouse:
SELECT POSTID ,(SELECT count(1) FROM DBO.POSTHISTORY ph WHERE p.POSTID = ph.POSTID) AS cnt FROM DBO.POSTS p
The posts table has almost 42 million rows and 21GB in size, while the post history table contains 109 million rows and is 37GB in size. Both tables come from the StackOverflow sample database.
The query returns 42 million rows (they're of course not all shown in the browser). One of the reasons Snowflake is this fast is because all data is stored in a columnar format, comparable with columnstore indexes in SQL Server. Data is divided into partitions. When scanning data, Snowflake can also eliminate partitions (if you use a WHERE clause for example) to minimize IO overhead.
However, Snowflake doesn't have many tuning options. There are no indexes you can put on a table (well, you can define a primary key and a foreign key, but they're not enforced. They serve as "documentation" and hints for client tools). You can experiment with the clustering key to try optimize performance and concurrency for a given table, but that's about it. If a SQL query is slow, you basically have two options:
- Scale up to a faster warehouse. If your problem is more IO related (i.e. data processing), for example there's an expensive sort and you're spilling to disk, this might not always be a solution.
- Rewrite the SQL query.
Not Only a Data Warehouse
The Snowflake architecture offers many features beyond just a "relational database optimized for data warehousing". Snowflake positions itself as a "cloud data platform". You have the data warehouse, but you can also build your data lake in Snowflake as well. There's built-in support for parsing semi-structured data like Json and XML, and you can also read data from various file formats such as Json, Avro, Parquet and Orc. There's support for streaming, using their own Snowpipe or Kafka, and Snowflake offers integration with Python and Spark.
Snowflake is not only there for the business intelligence professional, but also for the data engineers and the data scientists. It allows you to build one single data platform that can serve multiple purposes. You can find more examples on the Snowflake website.
- What do you think? Are there more advantages or disadvantages we should know about? Let us know in the comments!
- Check out other Snowflake resources:
About the author
View all my tips
Article Last Updated: 2022-02-16