Rapidly loading JSON data into Snowflake

At a recent job, I was attempting to back load years of JSON data stored in a Postgres database column into a flattened, denormalized data structure in snowflake for easy access. I wrote what I felt was a good, robust data loading algorithm using Pandas and SQLAlchemy primarily to handle the translation to a flat structure and load into snowflake. It would recognize new columns and dynamically alter the snowflake table to add them with the appropriate inferred type. However, what worked for the early days of the startup didn't scale well through our rapid growth. I learned several lessons:

The Pandas JSON normalize function is an extremely poor performer for any considerable amount of data. I ended up writing my own that saw infinite improvement in speed (I say infinite because finally, at one point the pandas library was unable to process a day's worth of our data on a server with 384 GB of RAM and 96 cores!)
Similarly, using a Pandas data frame became extremely inefficient as the rows and columns grew. I re-factored the process to keep all of the data in python structures (lists of dicts for the data, and a dict to track and manage/infer data types of the columns). Pandas dataframe was only used at the end to create an empty dataframe using the data type information. This was done so that I could let Pandas do the heavy lifting on converting the data structure to a `create table` statement. This increased performance three-fold.
Loading JSON data into snowflake became increasingly slow. Rather than throw a larger node at the problem, I enacted two strategies which increased throughput four-fold:
- Only write JSON fields that contained data. This became possible when I moved away from pandas data frame for managing the in memory data structure and generating the JSON data. Making this change improved the snowflake load time by double, but required writing a custom handler for the json package to write to newline delimited JSON, a format which surprisingly the JSON package team is reluctant to support natively.
- Use larger batches of records in each data stage uploaded to snowflake. This one wasn't an intuitive result, but by going from 50,000 records per upload to 20,000,000, the time was cut in half. It did require much more disk space to store the transient data file before putting it to snowflake, but storage is a good trade-off for performance.

I wish I could share the code, but the company views it as proprietary. Hopefully these tips will prove useful to someone.

Index

Rapidly loading JSON data into Snowflake