Blog

  • How to Build a Data Lakehouse on Azure

    How to Build a Data Lakehouse on Azure

     For years, data teams have faced a difficult choice: the structured, high-performance world of the data warehouse, or the flexible, low-cost scalability of the data lake. But what if you could have the best of both worlds? Enter the Data Lakehouse, an architectural pattern that combines the reliability and performance of a warehouse with the openness and flexibility of a data lake. And when it comes to implementation, building a data lakehouse on Azure has become the go-to strategy for future-focused data teams.

    The traditional data lake, while great for storing vast amounts of raw data, often turned into a “data swamp”—unreliable and difficult to manage. The data warehouse, on the other hand, struggled with unstructured data and could become rigid and expensive. The Lakehouse architecture solves this dilemma.

    In this guide, we’ll walk you through the blueprint for building a powerful and modern data lakehouse on Azure, leveraging a trio of best-in-class services: Azure Data Lake Storage (ADLS) Gen2, Azure Databricks, and Power BI.

    The Azure Lakehouse Architecture: A Powerful Trio

    A successful Lakehouse implementation relies on a few core services working in perfect harmony. This architecture is designed to handle everything from raw data ingestion and large-scale ETL to interactive analytics and machine learning.

    Here’s the high-level architecture we will build:

    1. Azure Data Lake Storage (ADLS) Gen2: This is the foundation. ADLS Gen2 is a highly scalable and cost-effective cloud storage solution that combines the best of a file system with massive scale, making it the perfect storage layer for our Lakehouse.

    2. Azure Databricks: This is the unified analytics engine. Databricks provides a collaborative environment for data engineers and data scientists to run large-scale data processing (ETL/ELT) with Spark, build machine learning models, and manage the entire data lifecycle.

    3. Delta Lake: The transactional storage layer. Built on top of ADLS, Delta Lake is an open-source technology (natively integrated into Databricks) that brings ACID transactions, data reliability, and high performance to your data lake, effectively turning it into a Lakehouse.

    4. Power BI: The visualization and reporting layer. Power BI integrates seamlessly with Azure Databricks, allowing business users to run interactive queries and build insightful dashboards directly on the data in the Lakehouse.

    Let’s explore each component.

    Step 1: The Foundation – Azure Data Lake Storage (ADLS) Gen2

    Every great data platform starts with a solid storage foundation. For a Lakehouse on Azure, ADLS Gen2 is the undisputed choice. Unlike standard object storage, it includes a hierarchical namespace, which allows you to organize your data into directories and folders just like a traditional file system. This is critical for performance and organization in large-scale analytics.

    A best practice is to structure your data lake using a multi-layered approach, often called “medallion architecture”:

    • Bronze Layer (/bronze): Raw, untouched data ingested from various source systems.

    • Silver Layer (/silver): Cleaned, filtered, and standardized data. This is where data quality rules are applied.

    • Gold Layer (/gold): Highly aggregated, business-ready data that is optimized for analytics and reporting.

    Step 2: The Engine – Azure Databricks

    With our storage in place, we need a powerful engine to process the data. Azure Databricks is a first-class service on Azure that provides a managed, high-performance Apache Spark environment.

    Data engineers use Databricks notebooks to:

    • Ingest raw data from the Bronze layer.

    • Perform large-scale transformations, cleaning, and enrichment using Spark.

    • Write the processed data to the Silver and Gold layers.

    Here’s a simple PySpark code snippet you might run in a Databricks notebook to process raw CSV files into a cleaned-up table:

    # Databricks notebook code snippet

    # Define paths for our data layers

    bronze_path = “/mnt/datalake/bronze/raw_orders.csv”

    silver_path = “/mnt/datalake/silver/cleaned_orders”

    # Read raw data from the Bronze layer using Spark

    df_bronze = spark.read.format(“csv”) \

      .option(“header”, “true”) \

      .option(“inferSchema”, “true”) \

      .load(bronze_path)

    # Perform basic transformations

    from pyspark.sql.functions import col, to_date

    df_silver = df_bronze.select(

        col(“OrderID”).alias(“order_id”),

        col(“CustomerID”).alias(“customer_id”),

        to_date(col(“OrderDate”), “MM/dd/yyyy”).alias(“order_date”),

        col(“Amount”).cast(“decimal(18, 2)”).alias(“order_amount”)

      ).where(col(“Amount”).isNotNull())

    # Write the cleaned data to the Silver layer

    df_silver.write.format(“delta”).mode(“overwrite”).save(silver_path)

    print(“Successfully processed raw orders into the Silver layer.”)

    Step 3: The Magic – Delta Lake

    Notice the .format(“delta”) in the code above? That’s the secret sauce. Delta Lake is an open-source storage layer that runs on top of your existing data lake (ADLS) and brings warehouse-like capabilities.

    Key features Delta Lake provides:

    • ACID Transactions: Ensures that your data operations either complete fully or not at all, preventing data corruption.

    • Time Travel (Data Versioning): Allows you to query previous versions of your data, making it easy to audit changes or roll back errors.

    • Schema Enforcement & Evolution: Prevents bad data from corrupting your tables by enforcing a schema, while still allowing you to gracefully evolve it over time.

    • Performance Optimization: Features like data skipping and Z-ordering dramatically speed up queries.

    By writing our data in the Delta format, we’ve transformed our simple cloud storage into a reliable, high-performance Lakehouse.

    Step 4: The Payoff – Visualization with Power BI

    With our data cleaned and stored in the Gold layer of our Lakehouse, the final step is to make it accessible to business users. Power BI has a native, high-performance connector for Azure Databricks.

    You can connect Power BI directly to your Databricks cluster and query the Gold tables. This allows you to:

    • Build interactive dashboards and reports.

    • Leverage Power BI’s powerful analytics and visualization capabilities.

    • Ensure that everyone in the organization is making decisions based on the same, single source of truth from the Lakehouse.

    Conclusion: The Best of Both Worlds on Azure

    By combining the low-cost, scalable storage of Azure Data Lake Storage Gen2 with the powerful processing engine of Azure Databricks and the reliability of Delta Lake, you can build a truly modern data lakehouse on Azure. This architecture eliminates the need to choose between a data lake and a data warehouse, giving you the flexibility, performance, and reliability needed to support all of your data and analytics workloads in a single, unified platform.

  • Building a Serverless Data Pipeline on AWS: A Step-by-Step Guide

    Building a Serverless Data Pipeline on AWS: A Step-by-Step Guide

     For data engineers, the dream is to build pipelines that are robust, scalable, and cost-effective. For years, this meant managing complex clusters and servers. But with the power of the cloud, a new paradigm has emerged: the serverless data pipeline on AWS. This approach allows you to process massive amounts of data without managing a single server, paying only for the compute you actually consume.

    Going serverless means you can say goodbye to idle clusters, patching servers, and capacity planning. Instead, you use a suite of powerful AWS services that automatically scale to meet demand. This isn’t just a technical shift; it’s a strategic advantage that allows your team to focus on delivering value from data, not managing infrastructure.

    In this guide, we’ll walk you through the essential components and steps to build a modern, event-driven serverless data pipeline on AWS using S3, Lambda, AWS Glue, and Athena.

    The Architecture: A Four-Part Harmony

    A successful serverless pipeline relies on a few core AWS services working together seamlessly. Each service has a specific role, creating an efficient and automated workflow from raw data ingestion to analytics-ready insights.

    Here’s a high-level look at our architecture:

    1. Amazon S3 (Simple Storage Service): The foundation of our pipeline. S3 acts as a highly durable and scalable data lake where we will store our raw, processed, and curated data in different stages.
    2. AWS Lambda: The trigger and orchestrator. Lambda functions are small, serverless pieces of code that can run in response to events, such as a new file being uploaded to S3.
    3. AWS Glue: The serverless ETL engine. Glue can automatically discover the schema of our data and run powerful Spark jobs to clean, transform, and enrich it, converting it into an optimized format like Parquet.
    4. Amazon Athena: The interactive query service. Athena allows us to run standard SQL queries directly on our processed data stored in S3, making it instantly available for analysis without needing a traditional data warehouse.

    Now, let’s build it step-by-step.

    Step 1: Setting Up the S3 Data Lake Buckets

    First, we need a place to store our data. A best practice is to use separate prefixes or even separate buckets to represent the different stages of your data pipeline, creating a clear and organized data lake.

    For this guide, we’ll use a single bucket with three prefixes:

    • s3://your-data-lake-bucket/raw/: This is where raw, unaltered data lands from your sources.
    • s3://your-data-lake-bucket/processed/: After cleaning and transformation by our Glue job, the data is stored here in an optimized format (e.g., Parquet).
    • s3://your-data-lake-bucket/curated/: (Optional) A final layer for business-level aggregations or specific data marts.

    Step 2: Creating the Lambda Trigger

    Next, we need a mechanism to automatically start our pipeline when new data arrives. AWS Lambda is perfect for this. We will create a Lambda function that “listens” for a file upload event in our raw/ S3 prefix and then starts our AWS Glue ETL job.

    Here is a sample Python code for the Lambda function:

    lambda_function.py

    Python

    import boto3
    import os
    
    def lambda_handler(event, context):
        """
        This Lambda function is triggered by an S3 event and starts an AWS Glue ETL job.
        """
        # Get the Glue job name from environment variables
        glue_job_name = os.environ['GLUE_JOB_NAME']
        
        # Extract the bucket and key from the S3 event
        bucket = event['Records'][0]['s3']['bucket']['name']
        key = event['Records'][0]['s3']['object']['key']
        
        print(f"File uploaded: s3://{bucket}/{key}")
        
        # Initialize the Glue client
        glue_client = boto3.client('glue')
        
        try:
            print(f"Starting Glue job: {glue_job_name}")
            response = glue_client.start_job_run(
                JobName=glue_job_name,
                Arguments={
                    '--S3_SOURCE_PATH': f"s3://{bucket}/{key}"
                }
            )
            print(f"Successfully started Glue job run. Run ID: {response['JobRunId']}")
            return {
                'statusCode': 200,
                'body': f"Started Glue job {glue_job_name} for file s3://{bucket}/{key}"
            }
        except Exception as e:
            print(f"Error starting Glue job: {e}")
            raise e
    
    

    To make this work, you need to:

    1. Create this Lambda function in the AWS console.
    2. Set an environment variable named GLUE_JOB_NAME with the name of the Glue job you’ll create in the next step.
    3. Configure an S3 trigger on the function, pointing it to your s3://your-data-lake-bucket/raw/ prefix for “All object create events.”

    Step 3: Transforming Data with AWS Glue

    AWS Glue is the heavy lifter in our pipeline. It’s a fully managed ETL service that makes it easy to prepare and load your data for analytics. For this step, you would create a Glue ETL job.

    Inside the Glue Studio, you can visually build a job or write a PySpark script. The job will:

    1. Read the raw data (e.g., CSV) from the source path passed by the Lambda function.
    2. Perform transformations, such as changing data types, dropping columns, or joining with other datasets.
    3. Write the transformed data to the processed/ S3 prefix in Apache Parquet format. Parquet is a columnar storage format that is highly optimized for analytical queries.

    Your Glue job will have a simple script that looks something like this:

    Python

    import sys
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    
    # Get job arguments
    args = getResolvedOptions(sys.argv, ['JOB_NAME', 'S3_SOURCE_PATH'])
    
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    
    # Read the raw CSV data from S3
    source_dyf = glueContext.create_dynamic_frame.from_options(
        connection_type="s3",
        connection_options={"paths": [args['S3_SOURCE_PATH']]},
        format="csv",
        format_options={"withHeader": True},
    )
    
    # Convert to Parquet and write to the processed location
    glueContext.write_dynamic_frame.from_options(
        frame=source_dyf,
        connection_type="s3",
        connection_options={"path": "s3://your-data-lake-bucket/processed/"},
        format="parquet",
    )
    
    job.commit()
    
    

    Step 4: Querying Processed Data with Amazon Athena

    Once your data is processed and stored as Parquet in S3, it’s ready for analysis. With Amazon Athena, you don’t need to load it into another database. You can query it right where it is.

    1. Create a Database: In the Athena query editor, create a database for your data lake: CREATE DATABASE my_data_lake;
    2. Run a Glue Crawler (or Create a Table): The easiest way to make your data queryable is to run an AWS Glue Crawler on your processed/ S3 prefix. The crawler will automatically detect the schema of your Parquet files and create an Athena table for you.
    3. Query Your Data: Once the table is created, you can run standard SQL queries on it.

    SQL

    SELECT
        customer_id,
        order_status,
        COUNT(order_id) as number_of_orders
    FROM
        my_data_lake.processed_data
    WHERE
        order_date >= '2025-01-01'
    GROUP BY
        1, 2
    ORDER BY
        3 DESC;
    

    Conclusion: The Power of Serverless

    You have now built a fully automated, event-driven, and serverless data pipeline on AWS. When a new file lands in your raw S3 bucket, a Lambda function triggers a Glue job that processes the data and writes it back to S3 in an optimized format, ready to be queried instantly by Athena.

    This architecture is not only powerful but also incredibly efficient. It scales automatically to handle terabytes of data and ensures you only pay for the resources you use, making it the perfect foundation for a modern data engineering stack.

  • Structuring dbt Projects in Snowflake: The Definitive Guide

    Structuring dbt Projects in Snowflake: The Definitive Guide

    If you’ve ever inherited a dbt project, you know there are two kinds: the clean, logical, and easy-to-navigate project, and the other kind—a tangled mess of models that makes you question every life choice that led you to that moment. The difference between the two isn’t talent; it’s structure. For high-performing data teams, a well-defined structure for dbt projects in Snowflake isn’t just a nice-to-have, it’s the very foundation of a scalable, maintainable, and trustworthy analytics workflow.

    While dbt and Snowflake are a technical match made in heaven, simply putting them together doesn’t guarantee success. Without a clear and consistent project structure, even the most powerful tools can lead to chaos. Dependencies become circular, model names become ambiguous, and new team members spend weeks just trying to understand the data flow.

    This guide provides a battle-tested blueprint for structuring dbt projects in Snowflake. We’ll move beyond the basics and dive into a scalable, multi-layered framework that will save you and your team countless hours of rework and debugging.

    Why dbt and Snowflake Are a Perfect Match

    Before we dive into project structure, it’s crucial to understand why this combination has become the gold standard for the modern data stack. Their synergy comes from a shared philosophy of decoupling, scalability, and performance.

    • Snowflake’s Decoupled Architecture: Its separation of storage and compute is revolutionary. This means you can run massive dbt transformations using a dedicated, powerful virtual warehouse without slowing down your BI tools.
    • dbt’s Transformation Power: dbt focuses on the “T” in ELT—transformation. It allows you to build, test, and document your data models using simple SQL, which it then compiles and runs directly inside Snowflake’s powerful engine.
    • Cost and Performance Synergy: Running dbt models in Snowflake is incredibly efficient. You can spin up a warehouse for a dbt run and spin it down the second it’s finished, meaning you only pay for the exact compute you use.
    • Zero-Copy Cloning for Development: Instantly create a zero-copy clone of your entire production database for development. This allows you to test your dbt project against production-scale data without incurring storage costs or impacting the production environment.

    In short, Snowflake provides the powerful, elastic engine, while dbt provides the organized, version-controlled, and testable framework to harness that engine.

    The Layered Approach: From Raw Data to Actionable Insights

    A scalable dbt project is like a well-organized factory. Raw materials come in one end, go through a series of refined production stages, and emerge as a finished product. We achieve this by structuring our models into distinct layers, each with a specific job.

    Our structure will follow this flow: Sources -> Staging -> Intermediate -> Marts.

    Layer 1: Declaring Your Sources (The Contract with Raw Data)

    Before you write a single line of transformation SQL, you must tell dbt where your raw data lives in Snowflake. This is done in a .yml file. Think of this file as a formal contract that declares your raw tables, allows you to add data quality tests, and serves as a foundation for your data lineage graph.

    Example: models/staging/sources.yml

    Let’s assume we have a RAW_DATA database in Snowflake with schemas from a jaffle_shop and stripe.

    YAML

    version: 2
    
    sources:
      - name: jaffle_shop
        database: raw_data 
        schema: jaffle_shop
        description: "Raw data from the primary application database."
        tables:
          - name: customers
            columns:
              - name: id
                tests:
                  - unique
                  - not_null
          - name: orders
            loaded_at_field: _etl_loaded_at
            freshness:
              warn_after: {count: 12, period: hour}
    
      - name: stripe
        database: raw_data
        schema: stripe
        tables:
          - name: payment
            columns:
              - name: orderid
                tests:
                  - relationships:
                      to: source('jaffle_shop', 'orders')
                      field: id
    

    Layer 2: Staging Models (Clean and Standardize)

    Staging models are the first line of transformation. They should have a 1:1 relationship with your source tables. The goal here is strict and simple:

    • DO: Rename columns, cast data types, and perform very light cleaning.
    • DO NOT: Join to other tables.

    This creates a clean, standardized version of each source table, forming a reliable foundation for the rest of your project.

    Example: models/staging/stg_customers.sql

    SQL

    -- models/staging/stg_customers.sql
    with source as (
        select * from {{ source('jaffle_shop', 'customers') }}
    ),
    
    renamed as (
        select
            id as customer_id,
            first_name,
            last_name
        from source
    )
    
    select * from renamed
    

    Layer 3: Intermediate Models (Build, Join, and Aggregate)

    This is where the real business logic begins. Intermediate models are the “workhorses” of your dbt project. They take the clean data from your staging models and start combining them.

    • DO: Join different staging models together.
    • DO: Perform complex calculations, aggregations, and business-specific logic.
    • Materialize them as tables if they are slow to run or used by many downstream models.

    These models are not typically exposed to business users. They are building blocks for your final data marts.

    Example: models/intermediate/int_orders_with_payments.sql

    SQL

    -- models/intermediate/int_orders_with_payments.sql
    with orders as (
        select * from {{ ref('stg_orders') }}
    ),
    
    payments as (
        select * from {{ ref('stg_payments') }}
    ),
    
    order_payments as (
        select
            order_id,
            sum(case when payment_status = 'success' then amount else 0 end) as total_amount
        from payments
        group by 1
    ),
    
    final as (
        select
            orders.order_id,
            orders.customer_id,
            orders.order_date,
            coalesce(order_payments.total_amount, 0) as amount
        from orders
        left join order_payments 
          on orders.order_id = order_payments.order_id
    )
    
    select * from final
    

    Layer 4: Data Marts (Ready for Analysis)

    Finally, we arrive at the data marts. These are the polished, final models that power your dashboards, reports, and analytics. They should be clean, easy to understand, and built for a specific business purpose (e.g., finance, marketing, product).

    • DO: Join intermediate models.
    • DO: Have clear, business-friendly column names.
    • DO NOT: Contain complex, nested logic. All the heavy lifting should have been done in the intermediate layer.

    These models are the “products” of your data factory, ready for consumption by BI tools like Tableau, Looker, or Power BI.

    Example: models/marts/fct_customer_orders.sql

    SQL

    -- models/marts/fct_customer_orders.sql
    with customers as (
        select * from {{ ref('stg_customers') }}
    ),
    
    orders as (
        select * from {{ ref('int_orders_with_payments') }}
    ),
    
    customer_orders as (
        select
            customers.customer_id,
            min(orders.order_date) as first_order_date,
            max(orders.order_date) as most_recent_order_date,
            count(orders.order_id) as number_of_orders,
            sum(orders.amount) as lifetime_value
        from customers
        left join orders 
          on customers.customer_id = orders.customer_id
        group by 1
    )
    
    select * from customer_orders
    

    Conclusion: Structure is Freedom

    By adopting a layered approach to your dbt projects in Snowflake, you move from a chaotic, hard-to-maintain process to a scalable, modular, and efficient analytics factory. This structure gives you:

    • Maintainability: When logic needs to change, you know exactly which model to edit.
    • Scalability: Onboarding new data sources or team members becomes a clear, repeatable process.
    • Trust: With testing at every layer, you build confidence in your data and empower the entire organization to make better, faster decisions.

    This framework isn’t just about writing cleaner code—it’s about building a foundation for a mature and reliable data culture.

  • Snowflake Architecture Explained: A Simple Breakdown

    Snowflake Architecture Explained: A Simple Breakdown

    In the world of data, Snowflake’s rapid rise to a leader in the cloud data space is a well-known story. However, what’s the secret behind its success? The answer isn’t just a list of features, but instead, its revolutionary Snowflake architecture. Specifically, this unique three-layer design makes it fundamentally different from traditional data warehouses and is the key to its powerful performance and scalability. Therefore, this post will take you beyond the marketing buzz and deconstruct these core layers, because this is the secret sauce that makes everything else—from infinite scaling to zero-copy cloning—possible.

    The Flaws of Traditional Data Warehouse Architecture

    Before diving into Snowflake, let’s first remember the pain points of traditional on-premise data warehouses. Historically, engineers built these systems on two types of architectures:

    1. Shared-Disk: In this model, multiple compute nodes (CPUs) all access the same central storage disk, which leads to a bottleneck at the disk level.
    2. Shared-Nothing: Here, each compute node has its own dedicated storage. To work on a large dataset, the system must shuffle data across the network between nodes, creating significant network congestion.

    As a result, you faced a fundamental problem in both cases: contention. Ultimately, this flawed architecture meant that data loading jobs would slow down analytics, complex queries would stall the system for everyone, and scaling became an expensive, all-or-nothing nightmare.

    Snowflake’s Tri-Factor Architecture: A Masterclass in Decoupling

    Fortunately, Snowflake’s founders saw this core problem and solved it with a unique, patented, multi-cluster, shared-data architecture they built specifically for the cloud. You can best understand this architecture as three distinct, independently scalable layers.

    Here’s a visual representation of how these layers interact:

    Diagram of the 3-layer Snowflake architecture, showing the decoupled storage, multi-cluster compute, and cloud services layers.

    Layer 1: The Centralized Storage Foundation

    At its base, Snowflake separates storage from everything else. All your data resides in a single, centralized storage repository using cloud object storage like Amazon, Blob Storage, or GCP.

    • Columnar format: Data is stored in compressed, columnar micro-partitions (50–500MB).
    • Immutable micro-partitions: Each partition includes metadata (e.g., min/max values) to optimize query pruning.
    • Self-optimizing: Snowflake automatically chooses the best compression and indexing strategies.

    Key Benefit: Users don’t manage storage directly—Snowflake handles organization, compression, and optimization

    Layer 2: The Decoupled Compute Architecture

    Indeed, this is where the real magic of the Snowflake architecture shines. The compute layer consists of independent clusters of compute resources called Virtual Warehouses. Because of this, the decoupled compute architecture allows each workload (ETL, BI, Data Science) to have its own dedicated warehouse, which completely eliminates resource contention.

    • Concurrency & Isolation: Multiple warehouses can access the same data without contention.
    • Auto-scaling: Warehouses can scale up/down based on workload.
    • Workload separation: You can assign different warehouses to different teams or tasks (e.g., ETL vs. BI).

    Key Benefit: Compute resources are decoupled from storage, allowing flexible scaling and workload isolation.

    Layer 3: The Cloud Services Layer as the Architecture’s Brain

    Finally, the services layer acts as the central nervous system of Snowflake, orchestrating everything. For example, this layer handles query optimization, security, metadata management, and transaction consistency. In addition, it enables powerful features like Zero-Copy Cloning, Time Travel, and Secure Data Sharing.

    • Authentication & access control: Role-based access, encryption, and security policies.
    • Query optimization: Parses, plans, and optimizes SQL queries.
    • Infrastructure management: Handles provisioning, monitoring, and failover.

    Key Benefit: This layer orchestrates the entire platform, ensuring seamless user experience and system reliability.

    Conclusion: Why the Snowflake Architecture is a Game-Changer

    In conclusion, Snowflake’s success is not an accident; rather, it’s the direct result of a revolutionary architecture that elegantly solves the core challenges that plagued data analytics for decades. By decoupling storage, compute, and services, the Snowflake architecture consequently delivers unparalleled:

    • Performance: Queries run fast without interruption.
    • Concurrency: All users and processes can work simultaneously.
    • Simplicity: The platform manages the complexity for you.
    • Cost-Effectiveness: You only pay for what you use.

    Ultimately, it’s not just an evolution; it’s a redefinition of what a data platform can be.