Snowflake: The Good, The Bad and The Ugly

      Comments Off on Snowflake: The Good, The Bad and The Ugly


Snowflake or SnowflakeDB is a cloud SaaS database for analytical workloads and batch data ingestion, typically used for building a data warehouse in the cloud. However, it appears to be so cool and shiny that people are getting mad at praising it all around the internet. Seeing that, I could not resist the urge to take a closer look at this technology and poke into some of its pain points. What have also stumbled me at first is the lack of SnowflakeDB criticism in the blogs and message boards, which sounds suspicious given the self-proclaimed customer base of more than 1000 enterprises. So, let’s take a closer look at it.

kcgA82RFiPHSlmojLTYfQOLugMby0T7hAXHA8f0UlaoQ62LOotBhyVu5ECEF i3ufLKHxdz575FbaDz18FMRjlAv3l2qiEAva5guqbx0L9lS1POKJZzlbT6bUStNEbXN bf U9zq
Diagram 1. Snowflake Architecture based on their SIGMOID paper.

I will not write a full review of the Snowflake design here as it will take a lot of pages, and Snowflake team has already done it pretty well. Instead, I will focus on the main principles of their design and the differentiation from the other DWH solutions available on the market:

  • Amazon S3 and Azure Storage are used as a storage backend. This approach allows SnowflakeDB to piggyback on cloud providers both in terms of capacity provisioning and IO throughput guarantees for their storage. This is pretty convenient: cloud providers tend to invest a lot of money in making their storage bullet-proof as this is one of the main value propositions of the cloud platform, and it will be stupid not to use it for your own advantage developing a new cloud-native solution.
  • Storage is decoupled from compute. Traditional MPP solutions use local HDDs of the cluster nodes to store the data, effectively creating a non-divisible scaling unit of a single cluster machine. Also, due to MPP design, all the machines in a cluster must be of the same hardware specs to avoid resource skew between cluster nodes, thus the coupling of compute resources with storage is pretty tight. I have seen a bunch of examples when the vendor put tremendous effort to decouple storage from compute: take Exadata, or multiple storage vendors trying to put their enterprise storage appliances under the more traditional MPP solutions like Vertica and Greenplum. It has never really achieved any success for a simple reason – to build a distributed storage elastic enough to handle MPP workload, you need to put a tremendous amount of time and money into its development, deployment and support. This is what all the cloud providers did to introduce their cloud storage offering, and this is what no enterprise can afford to repeat in their on-premise deployments. So, Snowflake has decoupled storage from compute, introducing a concept of stateless worker pools (Virtual Warehouses) that talk to the cloud storage to read and write the data.
  • Stateless compute nodes. Query execution is handled by the processes running in Virtual Warehouse. Virtual Warehouse is essentially a set of stateless worker nodes that you can provision and scale on demand. You also have an opportunity to run multiple Virtual Warehouses over the same data stored in your cloud storage, which is a big advantage. Saying they are stateless is not completely true, as each of the nodes in Virtual Warehouse utilizes local disks for caching the data read from cloud storage. Block affinity to the Virtual Warehouse nodes is used to take advantage of the caching, however “stealing” is possible, so the affinity is not strict.
R55ypp 4NeSs35wUG3Xko UpVFDiNPaDQY3n ysRm3OZsosqGI7W bN8Aw2YOkKZSCIR1ADPlh354QczqJW0bfXn65BJHMtHTIUlW tGPlm GeJ0kne wnxykHd8 t1LYOb6RJ3v
Diagram 2. Snowflake micro-partitions, illustration from the official documentation.
  • Micro-partitions. The data is stored in the cloud storage by reasonably sized blocks: 16MB in size based on SIGMOID paper, 50MB to 500MB of uncompressed data based on official documentation. This block is called micro-partition. Each block represents a set of rows from the table, and is stored in compressed columnar format. Each micro-partition has a set of statistics for each of the fields: min-max values, number of distinct values, bloom filter of the fields that exist in unstructured data. For unstructured data, some of the fields are pulled out of the unstructured blob (based on heuristics) and have the same field-level statistics collected on them. Also, this kind of fields are subject to heuristic-based type deduction, which allows extra statistics to be collected (for example, converting JSON field to integer and collecting min-max statistics for this integer). Statistics are collected for each block independently on block creation, which is handled by the Snowflake engine transparently to the users.
  • Centralised metadata management. SnowflakeDB has a component called Cloud Services. Effectively, Snowflake stores all the metadata for its customers in this layer in a secret sauce key-value store. Metadata is stored in a centralised manner (except for Snowflake VPS customers), which means block-level statistics and other metadata are stored in a single key-value store for a large set of customers. Storing metadata this way allows Snowflake to do interesting tricks like cloning the tables or the whole databases just by cloning their metadata. Metadata keeps referencing the same micro-partitions, and after the metadata fork the two objects can be changed independently of each other.
See also  The problem with user personas
Diagram 3. Snowflake online upgrades based on their SIGMOID paper.
  • Online upgrades. Software upgrades are transparent to you and are handled centrally by the Snowflake team. This is achieved by having stateless compute and centralised cloud storage. Snowflake handles this by doing a dual run of the old and new software versions to allow your client application transition from old to new deployment transparently.
  • MVCC on a block level. Data blocks in Snowflake are immutable, i.e. micro-partitions are immutable. Each update of the data stored in micro-partition effectively causes a new micro-partition with updated data to be created, and metadata about this event to be written to the centralised metadata storage. This way, MVCC is implemented on a block level. Imagine you read table T at the timestamp X. Query engine requests the metadata store to return all the blocks that existed in table T at the timestamp X, and then read these blocks. If at the later moment Y the update of table data has happened, metadata is updated to reflect this change, so that the query at timestamp Z will read different set of blocks than the query at timestamp X. Old blocks can be stored for 90+ days, if you have enough money to afford it as you will be billed for the cloud storage use of the old blocks. Expired blocks are automatically cleaned up by the cloud services auxiliary processes, so you don’t have to care about it.
  • No partitions. No indexes. No tunables. This is one of the main mottos of Snowflake – it is a solution that is ready to use out of the box and you don’t need to spend time on optimising and tuning it (according to the their marketing).

Of course, this solution has numerous advantages. I won’t describe them here, you can just read the official marketing materials. No one will speak about their advantages better than their own marketers. I will focus more on the shortcomings of the Snowflake, starting with the least technical ones:

  • You’re not owning deployment infrastructure. Which means you fully depend on Snowflake support to react fast in case of emergency. You can’t perform any data recovery or mitigation procedures on your side unless you have a separate on-demand deployment of a different DWH solution, which would obviously be too expensive to maintain.
  • You’re not owning your data. Metadata (including encryption keys) is decoupled from the data and is stored by Snowflake separately. It means all the S3 or Azure Storage blobs you have are completely meaningless if the Cloud Services layer is affected by the outage.
  • Shared Cloud Services. Cloud Services layer is shared across multiple customers (except VPS ones), which means any security incident related to Cloud Services might result in exposure of the customer data for multiple customers at the same time. Another problem with Shared Cloud Services is that the outage of the Cloud Services has a huge blast radius. It will affect multiple customers, and getting support in case of such an outage will be very problematic.
  • Rapid customer base growth. Surprisingly, this is a problem, and I’ll explain why. Fast growth of the customer base means rapidly increasing load on support and engineering departments of the company. For a complex enough software solution like Snowflake, it usually takes at least 6 months for a newly hired engineer to reach productivity, and in the first 6 months this new hire will consume loads of time of their experienced peers. So there are natural limits on how fast the engineering and support teams can grow. Decreasing quality of the support and the introduction of the first line support has been a direct consequence of this, and reported as a drawback by multiple customers.
  • Proprietary code. It means there is a clear cap on the level of expertise for all the engineers outside of the Snowflake company. This angle can also be used by Snowflake for selling professional services engagements, but it is clearly a pain for the customers, especially for the ones used to open source solutions like Apache Hadoop and Apache Spark.
  • Small ecosystem. Snowflake is still a young technology compared to its competitors, and so a lot of functionality is still missing: no geospatial, limited UI functionality, immature ETL tools integration.
  • Only cloud. AWS and Azure are the only deployment options. I have nothing to add to this.
See also  Outsourcing and Affiliate Sales, What you Need to Know

And some more technical issues directly related to their design:

  • No partitions. Yes, you can’t partition your fact table, it will be a single table with all the N years of events. To workaround this, you can sort the table data by a field like transaction date. This will allow Snowflake to use min-max statistics of micro-partitions to prune the ones that do not contain the relevant dates for the queries that filter on date. This is called clustered tables. Snowflake will gladly maintain the data clustered for you transparently, but of course for a fee of compute and storage resources required to achieve this.
  • No indexes. So you don’t have an index, but still want to have an efficient query that needs to select one customer from a data mart? And you have already used clustering on the date? Well, here is materialized view for you: just define a copy of the base table clustered by a different key, and here you get your fast customer-retrieving query. Of course, this comes at a cost of compute and storage resources required to maintain a full copy of the data and shadow each DML operation for the base table.
  • Only bulk data load. You remember that modifying even a single field in a single tow causes the whole block of data to be copied? This effectively means the only optimal operational mode for Snowflake is bulk inserts, and the greater the size of your batch, the more optimal will be the data processing and the final data layout. As an example, having an operational data store with Customer table and doing CDC that delivers updates of the 0.1% Customer table entries each 5 minutes to your DWH is not possible. You would have to do full table scan for each update, and on the update the Snowflake engine will create a copy of each micro-partition if you’re lucky enough (if the IDs of the customers with changed records are distributed evenly and happen to hit each micro-partition).
  • Limited UDFs. Due to being a SaaS solution, it has a big point on security. This means only interpreted languages are possible in UDFs, thus are the currently supported SQL and JS. You can say goodbye to the Python scripts as Python does not have a native sandbox solution. Also, Snowflake is not mature enough to introduce something like PL/SQL or at least PL/pgSQL.
  • No data constraints. No constraints are enforced except NOT NULL. No “CHECK” constraints for you to ensure data quality at the load time.
  • Limited workload management. If you want to have higher priority for some of the traffic – isolate it in separate virtual warehouse. It works, but not for all the use cases, and greater separation incurs greater costs. Fun anecdote: Snowflake allows the queries to spill to S3 and effectively never die due to OOM. So your data analysis with their queries containing cartesian products will be happy to utilize terabytes of the cloud storage resources. (Update: automatic scaling is available for the Snowflake Enterprise Edition and above, which helps to reduce the cost of running multiple virtual warehouses adjusting their size automatically based on the observed load)

As a summary, I’d say Snowflake is a really promising technology, and it is really a cloud-native DWH solution. I really like all the engineering work the company is doing, and looking forward to seeing more news on their successful customer base expansion. However, I remain skeptical when any technology pretends to be a silver bullet solution.


Disclaimer: everything in this article represents my personal and humble opinion, and is not affiliated with any of my employers.

This entry was posted in DBMS, Enterprises and tagged mpp, MVCC, rdbms, snowflake on by 0x0FFF.


Source link