Loading Data from S3 to Snowflake

 For any data engineer working in the modern data stack, loading data from a data lake like Amazon S3 into a cloud data platform like Snowflake is a daily reality. While it seems straightforward, the method you choose to load data from S3 to Snowflake can have a massive impact on performance, cost, and data latency.

Simply getting the data in is not enough. A senior data engineer builds pipelines that are efficient, scalable, and cost-effective.

This guide moves beyond a simple COPY command and covers four essential best practices for building a high-performance data ingestion pipeline between S3 and Snowflake.

1. Choose the Right Tool: Snowpipe vs. COPY Command

The first and most critical decision is selecting the right ingestion method for your use case.

Use Snowpipe for Continuous, Event-Driven Loading

Snowpipe is Snowflake’s serverless, automated data ingestion service. It listens for new files in an S3 bucket (via S3 event notifications) and automatically loads them into your target table.

  • When to use it: For near real-time data pipelines where new files are arriving frequently and unpredictably. Think logs, IoT data, or event streams.
  • Why it’s a best practice: It’s serverless, meaning you don’t need to manage a virtual warehouse for ingestion. Costs are calculated per-file, which is highly efficient for small, frequent loads.

SQL

-- Example Snowpipe setup
CREATE PIPE my_s3_pipe
  AUTO_INGEST = TRUE
AS
COPY INTO my_raw_table
FROM @my_s3_stage
FILE_FORMAT = (TYPE = 'PARQUET');

Use the COPY Command for Batch Loading

The traditional COPY INTO command is designed for bulk or batch loading of data. It requires a user-specified virtual warehouse to execute.

  • When to use it: For large, scheduled batch jobs where you are loading a large number of files at once (e.g., a nightly ETL process).
  • Why it’s a best practice: For massive data volumes, using a dedicated warehouse with a COPY command is often more performant and cost-effective than Snowpipe, as you can leverage the power of a larger warehouse to load files in parallel.

2. Optimize Your File Sizes

This is a simple but incredibly effective best practice. Snowflake’s ingestion performance is highly dependent on file size.

  • The Problem with Tiny Files: Loading thousands of tiny files (e.g., < 10 MB) creates significant overhead, as Snowflake incurs a small management cost for each file it processes.
  • The Problem with Giant Files: A single, massive file (e.g., > 5 GB) cannot be loaded in parallel, creating a bottleneck.
  • The Sweet Spot: Aim for file sizes between 100 MB and 250 MB (compressed). This allows Snowflake’s parallel processing to work most effectively.

Actionable Tip: If you have control over the source system, configure it to generate files in this optimal size range. If you are dealing with thousands of small files, consider adding a pre-processing step using AWS Glue or a Lambda function to compact them into larger files before loading.

3. Use an Optimal Folder Structure in S3

How you organize your files in S3 can dramatically improve query performance and simplify your data loading process. Use a logical, partitioned folder structure that includes the date and other key attributes.

A good folder structure: s3://your-bucket/source-name/table-name/YYYY/MM/DD/

Example: s3://my-data-lake/salesforce/orders/2025/09/28/orders_01.parquet

Why this is a best practice:

  • Simplified Loading: You can use the COPY command to load data from specific time ranges easily.
  • Partition Pruning: When you create external tables in Snowflake on top of this S3 data, Snowflake can automatically prune (ignore) folders that are not relevant to your query’s WHERE clause, drastically reducing scan time and cost.

SQL

-- Load data for a specific day
COPY INTO my_orders_table
FROM @my_s3_stage/salesforce/orders/2025/09/28/
FILE_FORMAT = (TYPE = 'PARQUET');

4. Always Load Pre-Processed, Columnar Data

Never load raw, uncompressed JSON or CSV files directly into your final Snowflake tables if you can avoid it. Pre-processing your data in the data lake leads to significant performance and cost savings.

  • Use Columnar Formats: Convert your raw data to a compressed, columnar format like Apache Parquet or ORC.
  • Benefits of Parquet:
    • Reduced Storage Costs: Parquet files are highly compressed, lowering your S3 storage bill.
    • Faster Loading: Snowflake is highly optimized for ingesting columnar formats.
    • Less Snowflake Compute: Because Parquet is columnar, Snowflake can read only the columns it needs during the load, which can be more efficient.

Actionable Tip: Use a tool like AWS Glue or a simple Lambda function to run a lightweight ETL job that converts incoming JSON or CSV files into Parquet before they are loaded by Snowpipe or the COPY command.

Conclusion

Loading data from S3 into Snowflake is a fundamental task, but optimizing it is what sets a great data engineer apart. By choosing the right tool for your workload (Snowpipe vs. COPY), optimizing your file sizes, using a logical folder structure, and leveraging efficient file formats like Parquet, you can build a data ingestion pipeline that is not only fast and reliable but also highly cost-effective.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *