Blog

Loading Data from S3 to Snowflake
For any data engineer working in the modern data stack, loading data from a data lake like Amazon S3 into a cloud data platform like Snowflake is a daily reality. While it seems straightforward, the method you choose to load data from S3 to Snowflake can have a massive impact on performance, cost, and data latency.

Simply getting the data in is not enough. A senior data engineer builds pipelines that are efficient, scalable, and cost-effective.

This guide moves beyond a simple COPY command and covers four essential best practices for building a high-performance data ingestion pipeline between S3 and Snowflake.

1. Choose the Right Tool: Snowpipe vs. COPY Command

The first and most critical decision is selecting the right ingestion method for your use case.

Use Snowpipe for Continuous, Event-Driven Loading

Snowpipe is Snowflake’s serverless, automated data ingestion service. It listens for new files in an S3 bucket (via S3 event notifications) and automatically loads them into your target table.
- When to use it: For near real-time data pipelines where new files are arriving frequently and unpredictably. Think logs, IoT data, or event streams.
- Why it’s a best practice: It’s serverless, meaning you don’t need to manage a virtual warehouse for ingestion. Costs are calculated per-file, which is highly efficient for small, frequent loads.
SQL
```
-- Example Snowpipe setup
CREATE PIPE my_s3_pipe
  AUTO_INGEST = TRUE
AS
COPY INTO my_raw_table
FROM @my_s3_stage
FILE_FORMAT = (TYPE = 'PARQUET');
```
Use the COPY Command for Batch Loading

The traditional COPY INTO command is designed for bulk or batch loading of data. It requires a user-specified virtual warehouse to execute.
- When to use it: For large, scheduled batch jobs where you are loading a large number of files at once (e.g., a nightly ETL process).
- Why it’s a best practice: For massive data volumes, using a dedicated warehouse with a COPY command is often more performant and cost-effective than Snowpipe, as you can leverage the power of a larger warehouse to load files in parallel.
2. Optimize Your File Sizes

This is a simple but incredibly effective best practice. Snowflake’s ingestion performance is highly dependent on file size.
- The Problem with Tiny Files: Loading thousands of tiny files (e.g., < 10 MB) creates significant overhead, as Snowflake incurs a small management cost for each file it processes.
- The Problem with Giant Files: A single, massive file (e.g., > 5 GB) cannot be loaded in parallel, creating a bottleneck.
- The Sweet Spot: Aim for file sizes between 100 MB and 250 MB (compressed). This allows Snowflake’s parallel processing to work most effectively.
Actionable Tip: If you have control over the source system, configure it to generate files in this optimal size range. If you are dealing with thousands of small files, consider adding a pre-processing step using AWS Glue or a Lambda function to compact them into larger files before loading.

3. Use an Optimal Folder Structure in S3

How you organize your files in S3 can dramatically improve query performance and simplify your data loading process. Use a logical, partitioned folder structure that includes the date and other key attributes.

A good folder structure: s3://your-bucket/source-name/table-name/YYYY/MM/DD/

Example: s3://my-data-lake/salesforce/orders/2025/09/28/orders_01.parquet

Why this is a best practice:
- Simplified Loading: You can use the COPY command to load data from specific time ranges easily.
- Partition Pruning: When you create external tables in Snowflake on top of this S3 data, Snowflake can automatically prune (ignore) folders that are not relevant to your query’s WHERE clause, drastically reducing scan time and cost.
SQL
```
-- Load data for a specific day
COPY INTO my_orders_table
FROM @my_s3_stage/salesforce/orders/2025/09/28/
FILE_FORMAT = (TYPE = 'PARQUET');
```
4. Always Load Pre-Processed, Columnar Data

Never load raw, uncompressed JSON or CSV files directly into your final Snowflake tables if you can avoid it. Pre-processing your data in the data lake leads to significant performance and cost savings.
- Use Columnar Formats: Convert your raw data to a compressed, columnar format like Apache Parquet or ORC.
- Benefits of Parquet:
  - Reduced Storage Costs: Parquet files are highly compressed, lowering your S3 storage bill.
  - Faster Loading: Snowflake is highly optimized for ingesting columnar formats.
  - Less Snowflake Compute: Because Parquet is columnar, Snowflake can read only the columns it needs during the load, which can be more efficient.
Actionable Tip: Use a tool like AWS Glue or a simple Lambda function to run a lightweight ETL job that converts incoming JSON or CSV files into Parquet before they are loaded by Snowpipe or the COPY command.

Conclusion

Loading data from S3 into Snowflake is a fundamental task, but optimizing it is what sets a great data engineer apart. By choosing the right tool for your workload (Snowpipe vs. COPY), optimizing your file sizes, using a logical folder structure, and leveraging efficient file formats like Parquet, you can build a data ingestion pipeline that is not only fast and reliable but also highly cost-effective.
September 29, 2025
AWS Data Pipeline Cost Optimization Strategies
Building a powerful data pipeline on AWS is one thing. Building one that doesn’t burn a hole in your company’s budget is another. As data volumes grow, the costs associated with storage, compute, and data transfer can quickly spiral out of control. For an experienced data engineer, mastering AWS data pipeline cost optimization is not just a valuable skill—it’s a necessity.

Optimizing your AWS bill isn’t about shutting down services; it’s about making intelligent, architectural choices. It’s about using the right tool for the job, understanding data lifecycle policies, and leveraging the full power of serverless and spot instances.

This guide will walk you through five practical, high-impact strategies to significantly reduce the cost of your AWS data pipelines.

1. Implement an S3 Intelligent-Tiering and Lifecycle Policy

Your data lake on Amazon S3 is the foundation of your pipeline, but storing everything in the “S3 Standard” class indefinitely is a costly mistake.
- S3 Intelligent-Tiering: This storage class is a game-changer for cost optimization. It automatically moves your data between two access tiers—a frequent access tier and an infrequent access tier—based on your access patterns, without any performance impact or operational overhead. This is perfect for data lakes where you might have hot data that’s frequently queried and cold data that’s rarely touched.
- S3 Lifecycle Policies: For data that has a predictable lifecycle, you can set up explicit rules. For example, you can automatically transition data from S3 Standard to “S3 Glacier Instant Retrieval” after 90 days for long-term archiving at a much lower cost. You can also set policies to automatically delete old, unnecessary files.
Actionable Tip: Enable S3 Intelligent-Tiering on your main data lake buckets. For logs or temporary data, create a lifecycle policy to automatically delete files older than 30 days.

2. Go Serverless with AWS Glue and Lambda

If you are still managing your own EC2-based Spark or Airflow clusters, you are likely overspending. Serverless services like AWS Glue and AWS Lambda ensure you only pay for the compute you actually use, down to the second.
- AWS Glue: Instead of running a persistent cluster, use Glue for your ETL jobs. A Glue job provisions the necessary resources when it starts and terminates them the second it finishes. There is zero cost for idle time.
- AWS Lambda: For small, event-driven tasks—like triggering a job when a file lands in S3—Lambda is incredibly cost-effective. You get one million free requests per month, and the cost per invocation is minuscule.
Actionable Tip: Refactor your cron-based ETL scripts running on an EC2 instance into an event-driven pipeline using an S3 trigger to start a Lambda function, which in turn starts an AWS Glue job.

3. Use Spot Instances for Batch Workloads

For non-critical, fault-tolerant batch processing jobs, EC2 Spot Instances can save you up to 90% on your compute costs compared to On-Demand prices. Spot Instances are spare EC2 capacity that AWS offers at a steep discount.

When to Use:
- Large, overnight ETL jobs.
- Model training in SageMaker.
- Any batch workload that can be stopped and restarted without major issues.
Actionable Tip: When configuring your AWS Glue jobs, you can set the “Worker type” and specify a “Maximum capacity.” Under the job’s security configuration, you can enable the use of Spot Instances. Similarly, for services like Amazon EMR or Kubernetes on EC2, you can configure your worker nodes to use Spot Instances.

4. Choose the Right File Format (Hello, Parquet!)

The way you store your data has a massive impact on both storage and query costs. Storing your data in a raw format like JSON or CSV is inefficient.

Apache Parquet is a columnar storage file format that is optimized for analytics.
- Smaller Storage Footprint: Parquet’s compression is highly efficient, often reducing file sizes by 75% or more compared to CSV. This directly lowers your S3 storage costs.
- Faster, Cheaper Queries: Because Parquet is columnar, query engines like Amazon Athena, Redshift Spectrum, and AWS Glue can read only the columns they need for a query, instead of scanning the entire file. This drastically reduces the amount of data scanned, which is how Athena and Redshift Spectrum charge you.
Actionable Tip: Add a step in your ETL pipeline to convert your raw data from JSON or CSV into Parquet before storing it in your “processed” S3 bucket.Python
```
# A simple AWS Glue script snippet to convert CSV to Parquet
import sys
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

# ... (Glue context setup)

# Read raw CSV data
source_dyf = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": ["s3://your-raw-bucket/data/"]},
    format="csv",
    format_options={"withHeader": True},
)

# Write data in Parquet format
glueContext.write_dynamic_frame.from_options(
    frame=source_dyf,
    connection_type="s3",
    connection_options={"path": "s3://your-processed-bucket/data/"},
    format="parquet",
)
```
5. Monitor and Alert with AWS Budgets

You can’t optimize what you can’t measure. AWS Budgets is a simple but powerful tool that allows you to set custom cost and usage budgets and receive alerts when they are exceeded.
- Set a Monthly Budget: Create a budget for your total monthly AWS spend.
- Use Cost Allocation Tags: Tag your resources (e.g., S3 buckets, Glue jobs, EC2 instances) by project or team. You can then create budgets that are specific to those tags.
- Create Alerts: Set up alerts to notify you via email or SNS when your costs are forecasted to exceed your budget.
Actionable Tip: Go to the AWS Budgets console and create a monthly cost budget for your data engineering projects. Set an alert to be notified when you reach 80% of your budgeted amount. This gives you time to investigate and act before costs get out of hand.

Conclusion

AWS data pipeline cost optimization is an ongoing process, not a one-time fix. By implementing smart storage strategies with S3, leveraging serverless compute with Glue and Lambda, using Spot Instances for batch jobs, optimizing your file formats, and actively monitoring your spending, you can build a highly efficient and cost-effective data platform that scales with your business.
September 29, 2025
Snowflake Performance Tuning Techniques
Snowflake is incredibly fast out of the box, but as your data and query complexity grow, even the most powerful engine needs a tune-up. Slow-running queries not only frustrate users but also lead to higher credit consumption and wasted costs. The good news is that most performance issues can be solved with a few key techniques.

If you’re an experienced data engineer, mastering Snowflake performance tuning is a critical skill that separates you from the crowd. It’s about understanding how Snowflake works under the hood and making strategic decisions to optimize your workloads.

This guide will walk you through five actionable techniques to diagnose and fix slow-running queries in Snowflake.

Before You Tune: Use the Query Profile

The first rule of optimization is: don’t guess, measure. Snowflake’s Query Profile is the single most important tool for diagnosing performance issues. Before applying any of these techniques, you should always analyze the query profile of a slow query to identify the bottlenecks. It will show you exactly which operators are taking the most time, how much data is being scanned, and if you’re spilling data to disk.

1. Right-Size Your Virtual Warehouse

One of the most common misconceptions is that a bigger warehouse is always better. The key is to choose the right size for your workload.
- Scale Up for Complexity: Increase the warehouse size (e.g., from Small to Medium) when you need to improve the performance of a single, complex query. Larger warehouses have more memory and local SSD caching, which is crucial for large sorts, joins, and aggregations.
- Scale Out for Concurrency: Use a multi-cluster warehouse when you need to handle a high number of simultaneous, simpler queries. This is ideal for BI dashboards where many users are running queries at the same time. Scaling out adds more warehouses of the same size, distributing the user load without making any single query faster.
Actionable Tip: If a single ETL job is slow, try running it on the next warehouse size up and measure the performance gain. If your BI dashboard is slow during peak hours, configure your warehouse as a multi-cluster warehouse with an auto-scaling policy.

2. Master Your Clustering Keys

This is arguably the most impactful technique for tuning large tables. Snowflake automatically stores data in micro-partitions. A clustering key co-locates data with similar values in the same micro-partitions, which allows Snowflake to prune (ignore) the partitions that aren’t needed for a query.

When to Use:
- On very large tables (hundreds of gigabytes or terabytes).
- When your queries frequently filter or join on a high-cardinality column (e.g., user_id, event_timestamp).
Actionable Tip: Analyze your slow queries in the Query Profile. If you see a “TableScan” operator that is scanning a huge number of partitions but only returning a few rows, it’s a strong indicator that you need a clustering key.SQL
```
-- Define a clustering key when creating a table
CREATE TABLE my_large_table (
  event_timestamp TIMESTAMP_NTZ,
  user_id VARCHAR,
  payload VARIANT
) CLUSTER BY (user_id, event_timestamp);

-- Check the clustering health of a table
SELECT SYSTEM$CLUSTERING_INFORMATION('my_large_table');
```
3. Avoid Spilling to Remote Storage

“Spilling” happens when an operation runs out of memory and has to write intermediate data to storage. Spilling to local SSD is fast, but spilling to remote cloud storage is a major performance killer.

How to Detect It:
- In the Query Profile, look for a “Bytes spilled to remote storage” warning on operators like Sort or Join.
How to Fix It:
1. Increase Warehouse Size: The simplest solution is to run the query on a larger warehouse with more available memory.
2. Optimize the Query: Try to reduce the amount of data being processed. Filter data as early as possible in your query, and select only the columns you need.
4. Use Materialized Views for High-Frequency Queries

If you have a complex query that is run very frequently on data that doesn’t change often, a Materialized View can provide a massive performance boost.

A materialized view pre-computes the result of a query and stores it, almost like a cached result set. When you query the materialized view, you’re just querying the stored results, which is incredibly fast. Snowflake automatically keeps the materialized view up-to-date in the background as the base table data changes.

When to Use:
- On a query that aggregates or joins data from a large, slowly changing table.
- When the query is run hundreds or thousands of times a day (e.g., powering a critical dashboard).
SQL
```
CREATE MATERIALIZED VIEW mv_daily_sales_summary AS
SELECT
  sale_date,
  category,
  SUM(amount) as total_sales
FROM
  raw_sales
GROUP BY
  1, 2;
```
5. Optimize Your Joins

Poorly optimized joins are a common cause of slow queries.
- Join Order: Join your largest tables last. Start by joining your smaller dimension tables together first, and then join them to your large fact table. This reduces the size of the intermediate result sets.
- Filter Early: Apply WHERE clauses to your tables before you join them, especially on the large fact table. This reduces the number of rows that need to be processed in the join.
SQL
```
-- GOOD: Filter before joining
SELECT
  u.user_name,
  SUM(s.amount)
FROM
  (SELECT * FROM sales WHERE sale_date > '2025-01-01') s -- Filter first
JOIN
  users u ON s.user_id = u.user_id
GROUP BY 1;

-- BAD: Join everything then filter
SELECT
  u.user_name,
  SUM(s.amount)
FROM
  sales s
JOIN
  users u ON s.user_id = u.user_id
WHERE
  s.sale_date > '2025-01-01' -- Filter last
GROUP BY 1;
```
Conclusion

Snowflake performance tuning is a blend of science and art. By using the Query Profile to diagnose bottlenecks and applying these five techniques—warehouse management, clustering, avoiding spilling, using materialized views, and optimizing joins—you can significantly improve the speed of your queries, reduce costs, and build a highly efficient data platform.
September 29, 2025
Advanced Snowflake Interview Questions for Experienced
Stop memorizing the difference between a VARCHAR and a TEXT field. If you’re an experienced data engineer, you know that real Snowflake interviews go much deeper. Hiring managers aren’t just looking for someone who knows the syntax; they’re looking for an architect who understands performance, cost optimization, and scalable design patterns.

Yet, most online resources are flooded with basic, entry-level questions that don’t prepare you for a senior-level discussion.

This guide is different. We’ve compiled a list of advanced, scenario-based Snowflake interview questions for experienced engineers that reflect the real-world challenges you’ll be expected to solve. Let’s dive in.

1. Architecture & Design Questions

These questions test your high-level understanding of Snowflake’s architecture and your ability to design robust solutions.

Q1: “We have a new data source that will be queried by both our BI team (frequent, small queries) and our data science team (infrequent, massive queries). How would you design the compute layer to handle this efficiently without one team impacting the other?”
- Why they’re asking: This is a core test of your understanding of multi-cluster virtual warehouses. They want to see if you can design for concurrency and cost-effectiveness.
- What a strong answer looks like:
  - “I would implement a multi-cluster warehouse strategy. For the BI team, I’d set up a dedicated warehouse, let’s call it BI_WH, in multi-cluster mode with an auto-scaling policy. This allows it to scale out horizontally to handle high concurrency during peak hours and scale back down to save costs.”
  - “For the data science team, I would create a separate, more powerful warehouse, say DS_WH. This could be a larger size (e.g., Large or X-Large) that is initially suspended. The data scientists can resume it when they need to run their heavy queries and suspend it immediately after, ensuring they have the power they need without incurring idle costs.”
  - “This completely isolates the workloads, ensuring the BI team’s dashboards remain fast and responsive, regardless of what the data science team is doing.”
Q2: “Describe a scenario where you would choose a larger warehouse size (e.g., X-Large) versus scaling out a multi-cluster warehouse.”
- Why they’re asking: To test your understanding of scaling up vs. scaling out.
- What a strong answer looks like:
  - “You scale up (increase warehouse size) when you need to improve the performance of a single, complex query. For example, a massive data transformation job with complex joins and aggregations on terabytes of data would benefit from the increased memory and compute of a larger warehouse.”
  - “You scale out (add clusters to a multi-cluster warehouse) when you need to handle high concurrency—many users running simple, fast queries at the same time. A customer-facing dashboard with hundreds of simultaneous users is a perfect use case for scaling out.”
2. Performance Tuning & Cost Optimization Questions

For an experienced engineer, managing costs is just as important as managing performance.

Q3: “A dashboard is running slower than expected. The query profile shows significant ‘table scan’ time. What are your first steps to diagnose and solve this?”
- Why they’re asking: This is a classic performance tuning question. They want to see your troubleshooting methodology.
- What a strong answer looks like:
  - “My first step would be to analyze the query profile in detail. A large table scan suggests that Snowflake is reading more data than necessary.”
  - “I’d immediately investigate the clustering key on the table. If the query frequently filters or joins on a specific column (e.g., event_timestamp or customer_id), but that column isn’t the clustering key, the table might have poor ‘clustering depth’. I would check SYSTEM$CLUSTERING_INFORMATION.”
  - “If the clustering is poor, I would consider defining a new clustering key on the most frequently filtered high-cardinality columns. For very large tables, I would also check if the query could be rewritten to take advantage of query pruning, for example, by adding a filter on a date partition column.”
Q4: “Your Snowflake costs have unexpectedly increased by 30% this month. How would you investigate the root cause?”
- Why they’re asking: This is a critical question about cost management and governance.
- What a strong answer looks like:
  - “I would start by querying the snowflake.account_usage schema, which is the source of truth for all credit consumption.”
  - “Specifically, I would use the WAREHOUSE_METERING_HISTORY view to identify which virtual warehouses are responsible for the increased credit usage. I’d aggregate by day and warehouse to pinpoint the spike.”
  - “Once I’ve identified the warehouse, I’d query the QUERY_HISTORY view, filtering by the problematic warehouse and time period. I’d look for long-running queries, queries with high bytes spilled to local or remote storage, or an unusual increase in the number of queries.”
  - “Finally, I would implement resource monitors to prevent this in the future. I’d set up monitors to suspend warehouses or send notifications when they reach, for example, 80% of their monthly credit quota.”
3. Data Ingestion & Integration Questions

These questions test your practical knowledge of getting data into Snowflake.

Q5: “Explain the differences between Snowpipe, Snowflake Tasks, and external tools like Fivetran/Airbyte for data ingestion. When would you choose one over the others?”
- Why they’re asking: To assess your knowledge of the modern data stack and your ability to choose the right tool for the job.
- What a strong answer looks like:
  - “Snowpipe is best for continuous, event-driven micro-batching. You’d use it for near real-time ingestion from sources like S3, where files are being dropped frequently and unpredictably. It’s serverless and highly efficient for this pattern.”
  - “Snowflake Tasks are for scheduled, batch-oriented workflows that run entirely within Snowflake. You’d use Tasks to orchestrate a series of SQL statements, like running an ELT job every hour to transform raw data that’s already landed in Snowflake.”
  - “External tools like Fivetran or Airbyte are best for connector-based ingestion from third-party sources like Salesforce, Google Analytics, or a PostgreSQL database. They handle the complexity of API changes and schema replication, saving significant development time. You wouldn’t build a custom Salesforce connector if a reliable, pre-built one exists.”
4. Scenario-Based & Problem-Solving Questions

These are designed to see how you think on your feet.

Q6: “You need to provide your marketing team with read-only access to a 50TB production table for a one-off analysis. The table is constantly being updated. How do you do this with minimal cost and without impacting the production environment?”
- Why they’re asking: This tests your knowledge of Zero-Copy Cloning.
- What a strong answer looks like:
  - “This is a perfect use case for Zero-Copy Cloning. I would create an instantaneous clone of the production table using the CREATE TABLE ... CLONE command. This operation doesn’t duplicate the 50TB of storage; it only copies the metadata, making it instant and virtually free from a storage perspective.”
  - “I would then grant the marketing team’s role SELECT privileges on this cloned table. They can run their heavy analytical queries on the clone using their own virtual warehouse, completely isolating their workload from our production systems. Once their analysis is complete, the cloned table can be dropped.”
Conclusion

These questions are just a starting point, but they represent the type of thinking required for a senior Snowflake Data Engineer role. It’s not just about knowing the features, but about knowing how to apply them to solve real-world problems of scale, performance, and cost. Good luck!
September 28, 2025
Automated ETL with Airflow and Python: A Practical Guide
In the world of data, consistency is king. Manually running scripts to fetch and process data is not just tedious; it’s prone to errors, delays, and gaps in your analytics. To build a reliable data-driven culture, you need automation. This is where building an automated ETL with Airflow and Python becomes a data engineer’s most valuable skill.

Apache Airflow is the industry-standard open-source platform for orchestrating complex data workflows. When combined with the power and flexibility of Python for data manipulation, you can create robust, scheduled, and maintainable pipelines that feed your analytics platforms with fresh data, day in and day out.

This guide will walk you through a practical example: building an Airflow DAG that automatically fetches cryptocurrency data from a public API, processes it with Python, and prepares it for analysis.

The Architecture: A Simple, Powerful Workflow

Our automated pipeline will consist of a few key components, orchestrated entirely by Airflow. The goal is to create a DAG (Directed Acyclic Graph) that defines the sequence of tasks required to get data from our source to its destination.

Here’s the high-level architecture of our ETL pipeline:

Public API: Our data source. We’ll use the free CoinGecko API to fetch the latest cryptocurrency prices.

Python Script: The core of our transformation logic. We’ll use the requests library to call the API and pandas to process the JSON response into a clean, tabular format.

Apache Airflow: The orchestrator. We will define a DAG that runs on a schedule (e.g., daily), executes our Python script, and handles logging, retries, and alerting.

Data Warehouse/Lake: The destination. The processed data will be saved as a CSV, which in a real-world scenario would be loaded into a data warehouse like Snowflake, BigQuery, or a data lake like Amazon S3.

Let’s get into the code.

Step 1: The Python ETL Script

First, we need a Python script that handles the logic of fetching and processing the data. This script will be called by our Airflow DAG. We’ll use a PythonVirtualenvOperator in Airflow, which means our script can have its own dependencies.

Create a file named get_crypto_prices.py in your Airflow project’s /include directory.

/include/get_crypto_prices.py

Python
```
import requests
import pandas as pd
from datetime import datetime

def fetch_and_process_crypto_data():
    """
    Fetches cryptocurrency data from the CoinGecko API and processes it.
    """
    print("Fetching data from CoinGecko API...")
    url = "https://api.coingecko.com/api/v3/simple/price"
    params = {
        'ids': 'bitcoin,ethereum,ripple,cardano,solana',
        'vs_currencies': 'usd',
        'include_market_cap': 'true',
        'include_24hr_vol': 'true',
        'include_24hr_change': 'true'
    }
    
    try:
        response = requests.get(url, params=params)
        response.raise_for_status()  # Raise an exception for bad status codes
        data = response.json()
        print("Data fetched successfully.")

        # Process the JSON data into a list of dictionaries
        processed_data = []
        for coin, details in data.items():
            processed_data.append({
                'coin': coin,
                'price_usd': details.get('usd'),
                'market_cap_usd': details.get('usd_market_cap'),
                'volume_24h_usd': details.get('usd_24h_vol'),
                'change_24h_percent': details.get('usd_24h_change'),
                'timestamp': datetime.now().isoformat()
            })

        # Create a pandas DataFrame
        df = pd.DataFrame(processed_data)
        
        # In a real pipeline, you'd load this to a database.
        # For this example, we'll save it to a CSV in the local filesystem.
        output_path = '/tmp/crypto_prices.csv'
        df.to_csv(output_path, index=False)
        print(f"Data processed and saved to {output_path}")
        
    except requests.exceptions.RequestException as e:
        print(f"Error fetching data from API: {e}")
        raise

if __name__ == "__main__":
    fetch_and_process_crypto_data()
```
Step 2: Creating the Airflow DAG

Now, let’s create the Airflow DAG that will schedule and run this script. This file will live in your Airflow dags/ folder.

We’ll use the @task decorator and the PythonVirtualenvOperator to create a clean, isolated task.

dags/crypto_etl_dag.py

Python
```
from __future__ import annotations

import pendulum

from airflow.models.dag import DAG
from airflow.operators.python import PythonVirtualenvOperator

with DAG(
    dag_id="crypto_price_etl_pipeline",
    start_date=pendulum.datetime(2025, 9, 27, tz="UTC"),
    schedule="0 8 * * *",  # Run daily at 8:00 AM UTC
    catchup=False,
    tags=["api", "python", "etl"],
    doc_md="""
    ## Cryptocurrency Price ETL Pipeline
    This DAG fetches the latest crypto prices from the CoinGecko API,
    processes the data with Python, and saves it as a CSV.
    """,
) as dag:
    
    run_etl_task = PythonVirtualenvOperator(
        task_id="run_python_etl_script",
        python_callable_source="""
from include.get_crypto_prices import fetch_and_process_crypto_data
fetch_and_process_crypto_data()
""",
        requirements=["pandas==2.1.0", "requests==2.31.0"],
        system_site_packages=False,
    )
```
This DAG is simple but powerful. Airflow will now:
- Run this pipeline automatically every day at 8:00 AM UTC.
- Create a temporary virtual environment and install pandas and requests for the task.
- Execute our Python function to fetch and process the data.
- Log the entire process, and alert you if anything fails.
Step 3: The Analytics Payoff

With our pipeline running automatically, we now have a consistently updated CSV file (/tmp/crypto_prices.csv on the Airflow worker). In a real-world scenario where this data is loaded into a SQL data warehouse, an analyst can now run queries to derive insights, knowing the data is always fresh.

An analyst could now answer questions like:
- What is the daily trend of Bitcoin’s market cap?
- Which coin had the highest percentage change in the last 24 hours?
- How does trading volume correlate with price changes across different coins?
Conclusion: Build Once, Benefit Forever

By investing a little time to build an automated ETL with Airflow and Python, you create a resilient and reliable data asset. This approach eliminates manual, error-prone work and provides your analytics team with the fresh, trustworthy data they need to make critical business decisions. This is the core of modern data engineering: building automated systems that deliver consistent value.
September 28, 2025
SQL Window Functions: The Ultimate Guide for Data Analysts
Every data professional knows the power of GROUP BY. It’s the trusty tool we all learn first, allowing us to aggregate data and calculate metrics like total sales per category or the number of users per city. But what happens when the questions get more complex?
- What are the top 3 best-selling products within each category?
- How does this month’s revenue compare to last month’s for each department?
- What is the running total of sales day-by-day?
Trying to answer these questions with GROUP BY alone can lead to complex, inefficient, and often unreadable queries. This is where SQL window functions come in. They are the superpower you need to perform complex analysis while keeping your queries clean and performant.

What Are Window Functions, Really?

A window function performs a calculation across a set of table rows that are somehow related to the current row. Unlike a GROUP BY which collapses rows into a single output row, a window function returns a value for every single row.

Think of it like this: a GROUP BY looks at the whole room and gives you one summary. A window function gives each person in the room a piece of information based on looking at a specific “window” of people around them (e.g., “the 3 tallest people in your group”).

The magic happens with the OVER() clause, which defines the “window” of rows the function should consider.

The Core Syntax

The basic syntax for a window function looks like this:

SQL
```
SELECT
  column_a,
  column_b,
  AGGREGATE_FUNCTION() OVER (PARTITION BY ... ORDER BY ...) AS new_column
FROM your_table;
```
- AGGREGATE_FUNCTION(): Can be an aggregate function like SUM(), AVG(), COUNT(), or a specialized window function like RANK().
- OVER(): This is the mandatory clause that tells SQL you’re using a window function.
- PARTITION BY column_name: This is like a GROUP BY within the window. It divides the rows into partitions (groups), and the function is calculated independently for each partition.
- ORDER BY column_name: This sorts the rows within each partition. This is essential for functions that depend on order, like RANK() or running totals.
Practical Examples: From Theory to Insight

Let’s use a sample sales table to see window functions in action.

order_id sale_date category product amount
101 2025-09-01 Electronics Laptop 1200
102 2025-09-01 Books SQL Guide 45
103 2025-09-02 Electronics Mouse 25
104 2025-09-02 Electronics Keyboard 75
105 2025-09-03 Books Data Viz 55

1. Calculating a Running Total

Goal: Find the cumulative sales total for each day.

SQL
```
SELECT
  sale_date,
  amount,
  SUM(amount) OVER (ORDER BY sale_date) AS running_total_sales
FROM sales;
```
Result:

sale_date amount running_total_sales
2025-09-01 1200 1200
2025-09-01 45 1245
2025-09-02 25 1270
2025-09-02 75 1345
2025-09-03 55 1400

2. Ranking Rows within a Group (RANK, DENSE_RANK, ROW_NUMBER)

Goal: Rank products by sales amount within each category.

This is where PARTITION BY becomes essential.

SQL
```
SELECT
  category,
  product,
  amount,
  RANK() OVER (PARTITION BY category ORDER BY amount DESC) AS rank_num,
  DENSE_RANK() OVER (PARTITION BY category ORDER BY amount DESC) AS dense_rank_num,
  ROW_NUMBER() OVER (PARTITION BY category ORDER BY amount DESC) AS row_num
FROM sales;
```
- RANK(): Gives the same rank for ties, but skips the next rank. (1, 2, 2, 4)
- DENSE_RANK(): Gives the same rank for ties, but does not skip. (1, 2, 2, 3)
- ROW_NUMBER(): Assigns a unique number to every row, regardless of ties. (1, 2, 3, 4)
3. Comparing to Previous/Next Rows (LAG and LEAD)

Goal: Find the sales amount from the previous day for each category.

LAG() looks “behind” in the partition, while LEAD() looks “ahead”.

SQL
```
SELECT
  sale_date,
  category,
  amount,
  LAG(amount, 1, 0) OVER (PARTITION BY category ORDER BY sale_date) AS previous_day_sales
FROM sales;
```
The 1 means look back one row, and the 0 is the default value if no previous row exists.

Result:

sale_date category amount previous_day_sales
2025-09-01 Books 45 0
2025-09-03 Books 55 45
2025-09-01 Electronics 1200 0
2025-09-02 Electronics 25 1200
2025-09-02 Electronics 75 25

Conclusion: Go Beyond GROUP BY

While GROUP BY is essential for aggregation, SQL window functions are the key to unlocking a deeper level of analytical insights. They allow you to perform calculations on a specific subset of rows without losing the detail of the individual rows.

By mastering functions like RANK(), SUM() OVER (...), LAG(), and LEAD(), you can write cleaner, more efficient queries and solve complex business problems that would be a nightmare to tackle with traditional aggregation alone.
September 28, 2025
How to Build a Data Lakehouse on Azure

For years, data teams have faced a difficult choice: the structured, high-performance world of the data warehouse, or the flexible, low-cost scalability of the data lake. But what if you could have the best of both worlds? Enter the Data Lakehouse, an architectural pattern that combines the reliability and performance of a warehouse with the openness and flexibility of a data lake. And when it comes to implementation, building a data lakehouse on Azure has become the go-to strategy for future-focused data teams.

The traditional data lake, while great for storing vast amounts of raw data, often turned into a “data swamp”—unreliable and difficult to manage. The data warehouse, on the other hand, struggled with unstructured data and could become rigid and expensive. The Lakehouse architecture solves this dilemma.

In this guide, we’ll walk you through the blueprint for building a powerful and modern data lakehouse on Azure, leveraging a trio of best-in-class services: Azure Data Lake Storage (ADLS) Gen2, Azure Databricks, and Power BI.

The Azure Lakehouse Architecture: A Powerful Trio

A successful Lakehouse implementation relies on a few core services working in perfect harmony. This architecture is designed to handle everything from raw data ingestion and large-scale ETL to interactive analytics and machine learning.

Here’s the high-level architecture we will build:

1. Azure Data Lake Storage (ADLS) Gen2: This is the foundation. ADLS Gen2 is a highly scalable and cost-effective cloud storage solution that combines the best of a file system with massive scale, making it the perfect storage layer for our Lakehouse.

2. Azure Databricks: This is the unified analytics engine. Databricks provides a collaborative environment for data engineers and data scientists to run large-scale data processing (ETL/ELT) with Spark, build machine learning models, and manage the entire data lifecycle.

3. Delta Lake: The transactional storage layer. Built on top of ADLS, Delta Lake is an open-source technology (natively integrated into Databricks) that brings ACID transactions, data reliability, and high performance to your data lake, effectively turning it into a Lakehouse.

4. Power BI: The visualization and reporting layer. Power BI integrates seamlessly with Azure Databricks, allowing business users to run interactive queries and build insightful dashboards directly on the data in the Lakehouse.

Let’s explore each component.

Step 1: The Foundation – Azure Data Lake Storage (ADLS) Gen2

Every great data platform starts with a solid storage foundation. For a Lakehouse on Azure, ADLS Gen2 is the undisputed choice. Unlike standard object storage, it includes a hierarchical namespace, which allows you to organize your data into directories and folders just like a traditional file system. This is critical for performance and organization in large-scale analytics.

A best practice is to structure your data lake using a multi-layered approach, often called “medallion architecture”:

• Bronze Layer (/bronze): Raw, untouched data ingested from various source systems.

• Silver Layer (/silver): Cleaned, filtered, and standardized data. This is where data quality rules are applied.

• Gold Layer (/gold): Highly aggregated, business-ready data that is optimized for analytics and reporting.

Step 2: The Engine – Azure Databricks

With our storage in place, we need a powerful engine to process the data. Azure Databricks is a first-class service on Azure that provides a managed, high-performance Apache Spark environment.

Data engineers use Databricks notebooks to:

• Ingest raw data from the Bronze layer.

• Perform large-scale transformations, cleaning, and enrichment using Spark.

• Write the processed data to the Silver and Gold layers.

Here’s a simple PySpark code snippet you might run in a Databricks notebook to process raw CSV files into a cleaned-up table:

# Databricks notebook code snippet

# Define paths for our data layers

bronze_path = “/mnt/datalake/bronze/raw_orders.csv”

silver_path = “/mnt/datalake/silver/cleaned_orders”

# Read raw data from the Bronze layer using Spark

df_bronze = spark.read.format(“csv”) \

.option(“header”, “true”) \

.option(“inferSchema”, “true”) \

.load(bronze_path)

# Perform basic transformations

from pyspark.sql.functions import col, to_date

df_silver = df_bronze.select(

col(“OrderID”).alias(“order_id”),

col(“CustomerID”).alias(“customer_id”),

to_date(col(“OrderDate”), “MM/dd/yyyy”).alias(“order_date”),

col(“Amount”).cast(“decimal(18, 2)”).alias(“order_amount”)

).where(col(“Amount”).isNotNull())

# Write the cleaned data to the Silver layer

df_silver.write.format(“delta”).mode(“overwrite”).save(silver_path)

print(“Successfully processed raw orders into the Silver layer.”)

Step 3: The Magic – Delta Lake

Notice the .format(“delta”) in the code above? That’s the secret sauce. Delta Lake is an open-source storage layer that runs on top of your existing data lake (ADLS) and brings warehouse-like capabilities.

Key features Delta Lake provides:

• ACID Transactions: Ensures that your data operations either complete fully or not at all, preventing data corruption.

• Time Travel (Data Versioning): Allows you to query previous versions of your data, making it easy to audit changes or roll back errors.

• Schema Enforcement & Evolution: Prevents bad data from corrupting your tables by enforcing a schema, while still allowing you to gracefully evolve it over time.

• Performance Optimization: Features like data skipping and Z-ordering dramatically speed up queries.

By writing our data in the Delta format, we’ve transformed our simple cloud storage into a reliable, high-performance Lakehouse.

Step 4: The Payoff – Visualization with Power BI

With our data cleaned and stored in the Gold layer of our Lakehouse, the final step is to make it accessible to business users. Power BI has a native, high-performance connector for Azure Databricks.

You can connect Power BI directly to your Databricks cluster and query the Gold tables. This allows you to:

• Build interactive dashboards and reports.

• Leverage Power BI’s powerful analytics and visualization capabilities.

• Ensure that everyone in the organization is making decisions based on the same, single source of truth from the Lakehouse.

Conclusion: The Best of Both Worlds on Azure

By combining the low-cost, scalable storage of Azure Data Lake Storage Gen2 with the powerful processing engine of Azure Databricks and the reliability of Delta Lake, you can build a truly modern data lakehouse on Azure. This architecture eliminates the need to choose between a data lake and a data warehouse, giving you the flexibility, performance, and reliability needed to support all of your data and analytics workloads in a single, unified platform.

September 27, 2025
Building a Serverless Data Pipeline on AWS: A Step-by-Step Guide
For data engineers, the dream is to build pipelines that are robust, scalable, and cost-effective. For years, this meant managing complex clusters and servers. But with the power of the cloud, a new paradigm has emerged: the serverless data pipeline on AWS. This approach allows you to process massive amounts of data without managing a single server, paying only for the compute you actually consume.

Going serverless means you can say goodbye to idle clusters, patching servers, and capacity planning. Instead, you use a suite of powerful AWS services that automatically scale to meet demand. This isn’t just a technical shift; it’s a strategic advantage that allows your team to focus on delivering value from data, not managing infrastructure.

In this guide, we’ll walk you through the essential components and steps to build a modern, event-driven serverless data pipeline on AWS using S3, Lambda, AWS Glue, and Athena.

The Architecture: A Four-Part Harmony

A successful serverless pipeline relies on a few core AWS services working together seamlessly. Each service has a specific role, creating an efficient and automated workflow from raw data ingestion to analytics-ready insights.

Here’s a high-level look at our architecture:
1. Amazon S3 (Simple Storage Service): The foundation of our pipeline. S3 acts as a highly durable and scalable data lake where we will store our raw, processed, and curated data in different stages.
2. AWS Lambda: The trigger and orchestrator. Lambda functions are small, serverless pieces of code that can run in response to events, such as a new file being uploaded to S3.
3. AWS Glue: The serverless ETL engine. Glue can automatically discover the schema of our data and run powerful Spark jobs to clean, transform, and enrich it, converting it into an optimized format like Parquet.
4. Amazon Athena: The interactive query service. Athena allows us to run standard SQL queries directly on our processed data stored in S3, making it instantly available for analysis without needing a traditional data warehouse.
Now, let’s build it step-by-step.

Step 1: Setting Up the S3 Data Lake Buckets

First, we need a place to store our data. A best practice is to use separate prefixes or even separate buckets to represent the different stages of your data pipeline, creating a clear and organized data lake.

For this guide, we’ll use a single bucket with three prefixes:
- s3://your-data-lake-bucket/raw/: This is where raw, unaltered data lands from your sources.
- s3://your-data-lake-bucket/processed/: After cleaning and transformation by our Glue job, the data is stored here in an optimized format (e.g., Parquet).
- s3://your-data-lake-bucket/curated/: (Optional) A final layer for business-level aggregations or specific data marts.
Step 2: Creating the Lambda Trigger

Next, we need a mechanism to automatically start our pipeline when new data arrives. AWS Lambda is perfect for this. We will create a Lambda function that “listens” for a file upload event in our raw/ S3 prefix and then starts our AWS Glue ETL job.

Here is a sample Python code for the Lambda function:

lambda_function.py

Python
```
import boto3
import os

def lambda_handler(event, context):
    """
    This Lambda function is triggered by an S3 event and starts an AWS Glue ETL job.
    """
    # Get the Glue job name from environment variables
    glue_job_name = os.environ['GLUE_JOB_NAME']
    
    # Extract the bucket and key from the S3 event
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    
    print(f"File uploaded: s3://{bucket}/{key}")
    
    # Initialize the Glue client
    glue_client = boto3.client('glue')
    
    try:
        print(f"Starting Glue job: {glue_job_name}")
        response = glue_client.start_job_run(
            JobName=glue_job_name,
            Arguments={
                '--S3_SOURCE_PATH': f"s3://{bucket}/{key}"
            }
        )
        print(f"Successfully started Glue job run. Run ID: {response['JobRunId']}")
        return {
            'statusCode': 200,
            'body': f"Started Glue job {glue_job_name} for file s3://{bucket}/{key}"
        }
    except Exception as e:
        print(f"Error starting Glue job: {e}")
        raise e
```
To make this work, you need to:
1. Create this Lambda function in the AWS console.
2. Set an environment variable named GLUE_JOB_NAME with the name of the Glue job you’ll create in the next step.
3. Configure an S3 trigger on the function, pointing it to your s3://your-data-lake-bucket/raw/ prefix for “All object create events.”
Step 3: Transforming Data with AWS Glue

AWS Glue is the heavy lifter in our pipeline. It’s a fully managed ETL service that makes it easy to prepare and load your data for analytics. For this step, you would create a Glue ETL job.

Inside the Glue Studio, you can visually build a job or write a PySpark script. The job will:
1. Read the raw data (e.g., CSV) from the source path passed by the Lambda function.
2. Perform transformations, such as changing data types, dropping columns, or joining with other datasets.
3. Write the transformed data to the processed/ S3 prefix in Apache Parquet format. Parquet is a columnar storage format that is highly optimized for analytical queries.
Your Glue job will have a simple script that looks something like this:

Python
```
import sys
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

# Get job arguments
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'S3_SOURCE_PATH'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read the raw CSV data from S3
source_dyf = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": [args['S3_SOURCE_PATH']]},
    format="csv",
    format_options={"withHeader": True},
)

# Convert to Parquet and write to the processed location
glueContext.write_dynamic_frame.from_options(
    frame=source_dyf,
    connection_type="s3",
    connection_options={"path": "s3://your-data-lake-bucket/processed/"},
    format="parquet",
)

job.commit()
```
Step 4: Querying Processed Data with Amazon Athena

Once your data is processed and stored as Parquet in S3, it’s ready for analysis. With Amazon Athena, you don’t need to load it into another database. You can query it right where it is.
1. Create a Database: In the Athena query editor, create a database for your data lake: CREATE DATABASE my_data_lake;
2. Run a Glue Crawler (or Create a Table): The easiest way to make your data queryable is to run an AWS Glue Crawler on your processed/ S3 prefix. The crawler will automatically detect the schema of your Parquet files and create an Athena table for you.
3. Query Your Data: Once the table is created, you can run standard SQL queries on it.
SQL
```
SELECT
    customer_id,
    order_status,
    COUNT(order_id) as number_of_orders
FROM
    my_data_lake.processed_data
WHERE
    order_date >= '2025-01-01'
GROUP BY
    1, 2
ORDER BY
    3 DESC;
```
Conclusion: The Power of Serverless

You have now built a fully automated, event-driven, and serverless data pipeline on AWS. When a new file lands in your raw S3 bucket, a Lambda function triggers a Glue job that processes the data and writes it back to S3 in an optimized format, ready to be queried instantly by Athena.

This architecture is not only powerful but also incredibly efficient. It scales automatically to handle terabytes of data and ensures you only pay for the resources you use, making it the perfect foundation for a modern data engineering stack.
September 26, 2025
Structuring dbt Projects in Snowflake: The Definitive Guide
If you’ve ever inherited a dbt project, you know there are two kinds: the clean, logical, and easy-to-navigate project, and the other kind—a tangled mess of models that makes you question every life choice that led you to that moment. The difference between the two isn’t talent; it’s structure. For high-performing data teams, a well-defined structure for dbt projects in Snowflake isn’t just a nice-to-have, it’s the very foundation of a scalable, maintainable, and trustworthy analytics workflow.

While dbt and Snowflake are a technical match made in heaven, simply putting them together doesn’t guarantee success. Without a clear and consistent project structure, even the most powerful tools can lead to chaos. Dependencies become circular, model names become ambiguous, and new team members spend weeks just trying to understand the data flow.

This guide provides a battle-tested blueprint for structuring dbt projects in Snowflake. We’ll move beyond the basics and dive into a scalable, multi-layered framework that will save you and your team countless hours of rework and debugging.

Why dbt and Snowflake Are a Perfect Match

Before we dive into project structure, it’s crucial to understand why this combination has become the gold standard for the modern data stack. Their synergy comes from a shared philosophy of decoupling, scalability, and performance.
- Snowflake’s Decoupled Architecture: Its separation of storage and compute is revolutionary. This means you can run massive dbt transformations using a dedicated, powerful virtual warehouse without slowing down your BI tools.
- dbt’s Transformation Power: dbt focuses on the “T” in ELT—transformation. It allows you to build, test, and document your data models using simple SQL, which it then compiles and runs directly inside Snowflake’s powerful engine.
- Cost and Performance Synergy: Running dbt models in Snowflake is incredibly efficient. You can spin up a warehouse for a dbt run and spin it down the second it’s finished, meaning you only pay for the exact compute you use.
- Zero-Copy Cloning for Development: Instantly create a zero-copy clone of your entire production database for development. This allows you to test your dbt project against production-scale data without incurring storage costs or impacting the production environment.
In short, Snowflake provides the powerful, elastic engine, while dbt provides the organized, version-controlled, and testable framework to harness that engine.

The Layered Approach: From Raw Data to Actionable Insights

A scalable dbt project is like a well-organized factory. Raw materials come in one end, go through a series of refined production stages, and emerge as a finished product. We achieve this by structuring our models into distinct layers, each with a specific job.

Our structure will follow this flow: Sources -> Staging -> Intermediate -> Marts.

Layer 1: Declaring Your Sources (The Contract with Raw Data)

Before you write a single line of transformation SQL, you must tell dbt where your raw data lives in Snowflake. This is done in a .yml file. Think of this file as a formal contract that declares your raw tables, allows you to add data quality tests, and serves as a foundation for your data lineage graph.

Example: models/staging/sources.yml

Let’s assume we have a RAW_DATA database in Snowflake with schemas from a jaffle_shop and stripe.

YAML
```
version: 2

sources:
  - name: jaffle_shop
    database: raw_data 
    schema: jaffle_shop
    description: "Raw data from the primary application database."
    tables:
      - name: customers
        columns:
          - name: id
            tests:
              - unique
              - not_null
      - name: orders
        loaded_at_field: _etl_loaded_at
        freshness:
          warn_after: {count: 12, period: hour}

  - name: stripe
    database: raw_data
    schema: stripe
    tables:
      - name: payment
        columns:
          - name: orderid
            tests:
              - relationships:
                  to: source('jaffle_shop', 'orders')
                  field: id
```
Layer 2: Staging Models (Clean and Standardize)

Staging models are the first line of transformation. They should have a 1:1 relationship with your source tables. The goal here is strict and simple:
- DO: Rename columns, cast data types, and perform very light cleaning.
- DO NOT: Join to other tables.
This creates a clean, standardized version of each source table, forming a reliable foundation for the rest of your project.

Example: models/staging/stg_customers.sql

SQL
```
-- models/staging/stg_customers.sql
with source as (
    select * from {{ source('jaffle_shop', 'customers') }}
),

renamed as (
    select
        id as customer_id,
        first_name,
        last_name
    from source
)

select * from renamed
```
Layer 3: Intermediate Models (Build, Join, and Aggregate)

This is where the real business logic begins. Intermediate models are the “workhorses” of your dbt project. They take the clean data from your staging models and start combining them.
- DO: Join different staging models together.
- DO: Perform complex calculations, aggregations, and business-specific logic.
- Materialize them as tables if they are slow to run or used by many downstream models.
These models are not typically exposed to business users. They are building blocks for your final data marts.

Example: models/intermediate/int_orders_with_payments.sql

SQL
```
-- models/intermediate/int_orders_with_payments.sql
with orders as (
    select * from {{ ref('stg_orders') }}
),

payments as (
    select * from {{ ref('stg_payments') }}
),

order_payments as (
    select
        order_id,
        sum(case when payment_status = 'success' then amount else 0 end) as total_amount
    from payments
    group by 1
),

final as (
    select
        orders.order_id,
        orders.customer_id,
        orders.order_date,
        coalesce(order_payments.total_amount, 0) as amount
    from orders
    left join order_payments 
      on orders.order_id = order_payments.order_id
)

select * from final
```
Layer 4: Data Marts (Ready for Analysis)

Finally, we arrive at the data marts. These are the polished, final models that power your dashboards, reports, and analytics. They should be clean, easy to understand, and built for a specific business purpose (e.g., finance, marketing, product).
- DO: Join intermediate models.
- DO: Have clear, business-friendly column names.
- DO NOT: Contain complex, nested logic. All the heavy lifting should have been done in the intermediate layer.
These models are the “products” of your data factory, ready for consumption by BI tools like Tableau, Looker, or Power BI.

Example: models/marts/fct_customer_orders.sql

SQL
```
-- models/marts/fct_customer_orders.sql
with customers as (
    select * from {{ ref('stg_customers') }}
),

orders as (
    select * from {{ ref('int_orders_with_payments') }}
),

customer_orders as (
    select
        customers.customer_id,
        min(orders.order_date) as first_order_date,
        max(orders.order_date) as most_recent_order_date,
        count(orders.order_id) as number_of_orders,
        sum(orders.amount) as lifetime_value
    from customers
    left join orders 
      on customers.customer_id = orders.customer_id
    group by 1
)

select * from customer_orders
```
Conclusion: Structure is Freedom

By adopting a layered approach to your dbt projects in Snowflake, you move from a chaotic, hard-to-maintain process to a scalable, modular, and efficient analytics factory. This structure gives you:
- Maintainability: When logic needs to change, you know exactly which model to edit.
- Scalability: Onboarding new data sources or team members becomes a clear, repeatable process.
- Trust: With testing at every layer, you build confidence in your data and empower the entire organization to make better, faster decisions.
This framework isn’t just about writing cleaner code—it’s about building a foundation for a mature and reliable data culture.
September 25, 2025
Snowflake Architecture Explained: A Simple Breakdown
In the world of data, Snowflake’s rapid rise to a leader in the cloud data space is a well-known story. However, what’s the secret behind its success? The answer isn’t just a list of features, but instead, its revolutionary Snowflake architecture. Specifically, this unique three-layer design makes it fundamentally different from traditional data warehouses and is the key to its powerful performance and scalability. Therefore, this post will take you beyond the marketing buzz and deconstruct these core layers, because this is the secret sauce that makes everything else—from infinite scaling to zero-copy cloning—possible.

The Flaws of Traditional Data Warehouse Architecture

Before diving into Snowflake, let’s first remember the pain points of traditional on-premise data warehouses. Historically, engineers built these systems on two types of architectures:
1. Shared-Disk: In this model, multiple compute nodes (CPUs) all access the same central storage disk, which leads to a bottleneck at the disk level.
2. Shared-Nothing: Here, each compute node has its own dedicated storage. To work on a large dataset, the system must shuffle data across the network between nodes, creating significant network congestion.
As a result, you faced a fundamental problem in both cases: contention. Ultimately, this flawed architecture meant that data loading jobs would slow down analytics, complex queries would stall the system for everyone, and scaling became an expensive, all-or-nothing nightmare.

Snowflake’s Tri-Factor Architecture: A Masterclass in Decoupling

Fortunately, Snowflake’s founders saw this core problem and solved it with a unique, patented, multi-cluster, shared-data architecture they built specifically for the cloud. You can best understand this architecture as three distinct, independently scalable layers.

Here’s a visual representation of how these layers interact:

Layer 1: The Centralized Storage Foundation

At its base, Snowflake separates storage from everything else. All your data resides in a single, centralized storage repository using cloud object storage like Amazon, Blob Storage, or GCP.
- Columnar format: Data is stored in compressed, columnar micro-partitions (50–500MB).
- Immutable micro-partitions: Each partition includes metadata (e.g., min/max values) to optimize query pruning.
- Self-optimizing: Snowflake automatically chooses the best compression and indexing strategies.
Key Benefit: Users don’t manage storage directly—Snowflake handles organization, compression, and optimization

Layer 2: The Decoupled Compute Architecture

Indeed, this is where the real magic of the Snowflake architecture shines. The compute layer consists of independent clusters of compute resources called Virtual Warehouses. Because of this, the decoupled compute architecture allows each workload (ETL, BI, Data Science) to have its own dedicated warehouse, which completely eliminates resource contention.
- Concurrency & Isolation: Multiple warehouses can access the same data without contention.
- Auto-scaling: Warehouses can scale up/down based on workload.
- Workload separation: You can assign different warehouses to different teams or tasks (e.g., ETL vs. BI).
Key Benefit: Compute resources are decoupled from storage, allowing flexible scaling and workload isolation.

Layer 3: The Cloud Services Layer as the Architecture’s Brain

Finally, the services layer acts as the central nervous system of Snowflake, orchestrating everything. For example, this layer handles query optimization, security, metadata management, and transaction consistency. In addition, it enables powerful features like Zero-Copy Cloning, Time Travel, and Secure Data Sharing.
- Authentication & access control: Role-based access, encryption, and security policies.
- Query optimization: Parses, plans, and optimizes SQL queries.
- Infrastructure management: Handles provisioning, monitoring, and failover.
Key Benefit: This layer orchestrates the entire platform, ensuring seamless user experience and system reliability.

Conclusion: Why the Snowflake Architecture is a Game-Changer

In conclusion, Snowflake’s success is not an accident; rather, it’s the direct result of a revolutionary architecture that elegantly solves the core challenges that plagued data analytics for decades. By decoupling storage, compute, and services, the Snowflake architecture consequently delivers unparalleled:
- Performance: Queries run fast without interruption.
- Concurrency: All users and processes can work simultaneously.
- Simplicity: The platform manages the complexity for you.
- Cost-Effectiveness: You only pay for what you use.
Ultimately, it’s not just an evolution; it’s a redefinition of what a data platform can be.
September 23, 2025

order_id	sale_date	category	product	amount
101	2025-09-01	Electronics	Laptop	1200
102	2025-09-01	Books	SQL Guide	45
103	2025-09-02	Electronics	Mouse	25
104	2025-09-02	Electronics	Keyboard	75
105	2025-09-03	Books	Data Viz	55

sale_date	amount	running_total_sales
2025-09-01	1200	1200
2025-09-01	45	1245
2025-09-02	25	1270
2025-09-02	75	1345
2025-09-03	55	1400

Blog

1. Choose the Right Tool: Snowpipe vs. COPY Command

Use Snowpipe for Continuous, Event-Driven Loading

Use the COPY Command for Batch Loading

2. Optimize Your File Sizes

3. Use an Optimal Folder Structure in S3

4. Always Load Pre-Processed, Columnar Data

Conclusion

2. Go Serverless with AWS Glue and Lambda

3. Use Spot Instances for Batch Workloads

4. Choose the Right File Format (Hello, Parquet!)

5. Monitor and Alert with AWS Budgets

Conclusion

Before You Tune: Use the Query Profile

2. Master Your Clustering Keys

3. Avoid Spilling to Remote Storage

4. Use Materialized Views for High-Frequency Queries

5. Optimize Your Joins

Conclusion

1. Architecture & Design Questions

Q1: “We have a new data source that will be queried by both our BI team (frequent, small queries) and our data science team (infrequent, massive queries). How would you design the compute layer to handle this efficiently without one team impacting the other?”

Q2: “Describe a scenario where you would choose a larger warehouse size (e.g., X-Large) versus scaling out a multi-cluster warehouse.”

2. Performance Tuning & Cost Optimization Questions

Q3: “A dashboard is running slower than expected. The query profile shows significant ‘table scan’ time. What are your first steps to diagnose and solve this?”

Q4: “Your Snowflake costs have unexpectedly increased by 30% this month. How would you investigate the root cause?”

3. Data Ingestion & Integration Questions

Q5: “Explain the differences between Snowpipe, Snowflake Tasks, and external tools like Fivetran/Airbyte for data ingestion. When would you choose one over the others?”

4. Scenario-Based & Problem-Solving Questions

Q6: “You need to provide your marketing team with read-only access to a 50TB production table for a one-off analysis. The table is constantly being updated. How do you do this with minimal cost and without impacting the production environment?”

Conclusion

The Architecture: A Simple, Powerful Workflow

Step 1: The Python ETL Script

Step 2: Creating the Airflow DAG

Step 3: The Analytics Payoff

Conclusion: Build Once, Benefit Forever

What Are Window Functions, Really?

The Core Syntax

Practical Examples: From Theory to Insight

1. Calculating a Running Total

2. Ranking Rows within a Group (RANK, DENSE_RANK, ROW_NUMBER)

3. Comparing to Previous/Next Rows (LAG and LEAD)

Conclusion: Go Beyond GROUP BY

The Architecture: A Four-Part Harmony

Step 1: Setting Up the S3 Data Lake Buckets

Step 2: Creating the Lambda Trigger

Step 3: Transforming Data with AWS Glue

Step 4: Querying Processed Data with Amazon Athena

Conclusion: The Power of Serverless

Why dbt and Snowflake Are a Perfect Match

The Layered Approach: From Raw Data to Actionable Insights

Layer 1: Declaring Your Sources (The Contract with Raw Data)

Example: models/staging/sources.yml

Layer 2: Staging Models (Clean and Standardize)

Example: models/staging/stg_customers.sql

Layer 3: Intermediate Models (Build, Join, and Aggregate)

Example: models/intermediate/int_orders_with_payments.sql

Layer 4: Data Marts (Ready for Analysis)

Example: models/marts/fct_customer_orders.sql

Conclusion: Structure is Freedom

The Flaws of Traditional Data Warehouse Architecture

Snowflake’s Tri-Factor Architecture: A Masterclass in Decoupling

Layer 1: The Centralized Storage Foundation

Layer 2: The Decoupled Compute Architecture

Layer 3: The Cloud Services Layer as the Architecture’s Brain

Conclusion: Why the Snowflake Architecture is a Game-Changer

2. Ranking Rows within a Group (`RANK`, `DENSE_RANK`, `ROW_NUMBER`)

3. Comparing to Previous/Next Rows (`LAG` and `LEAD`)

Conclusion: Go Beyond `GROUP BY`

Example: `models/staging/sources.yml`

Example: `models/staging/stg_customers.sql`

Example: `models/intermediate/int_orders_with_payments.sql`

Example: `models/marts/fct_customer_orders.sql`