Building a Serverless Data Pipeline on AWS: A Step-by-Step Guide

 For data engineers, the dream is to build pipelines that are robust, scalable, and cost-effective. For years, this meant managing complex clusters and servers. But with the power of the cloud, a new paradigm has emerged: the serverless data pipeline on AWS. This approach allows you to process massive amounts of data without managing a single server, paying only for the compute you actually consume.

Going serverless means you can say goodbye to idle clusters, patching servers, and capacity planning. Instead, you use a suite of powerful AWS services that automatically scale to meet demand. This isn’t just a technical shift; it’s a strategic advantage that allows your team to focus on delivering value from data, not managing infrastructure.

In this guide, we’ll walk you through the essential components and steps to build a modern, event-driven serverless data pipeline on AWS using S3, Lambda, AWS Glue, and Athena.

The Architecture: A Four-Part Harmony

A successful serverless pipeline relies on a few core AWS services working together seamlessly. Each service has a specific role, creating an efficient and automated workflow from raw data ingestion to analytics-ready insights.

Here’s a high-level look at our architecture:

  1. Amazon S3 (Simple Storage Service): The foundation of our pipeline. S3 acts as a highly durable and scalable data lake where we will store our raw, processed, and curated data in different stages.
  2. AWS Lambda: The trigger and orchestrator. Lambda functions are small, serverless pieces of code that can run in response to events, such as a new file being uploaded to S3.
  3. AWS Glue: The serverless ETL engine. Glue can automatically discover the schema of our data and run powerful Spark jobs to clean, transform, and enrich it, converting it into an optimized format like Parquet.
  4. Amazon Athena: The interactive query service. Athena allows us to run standard SQL queries directly on our processed data stored in S3, making it instantly available for analysis without needing a traditional data warehouse.

Now, let’s build it step-by-step.

Step 1: Setting Up the S3 Data Lake Buckets

First, we need a place to store our data. A best practice is to use separate prefixes or even separate buckets to represent the different stages of your data pipeline, creating a clear and organized data lake.

For this guide, we’ll use a single bucket with three prefixes:

  • s3://your-data-lake-bucket/raw/: This is where raw, unaltered data lands from your sources.
  • s3://your-data-lake-bucket/processed/: After cleaning and transformation by our Glue job, the data is stored here in an optimized format (e.g., Parquet).
  • s3://your-data-lake-bucket/curated/: (Optional) A final layer for business-level aggregations or specific data marts.

Step 2: Creating the Lambda Trigger

Next, we need a mechanism to automatically start our pipeline when new data arrives. AWS Lambda is perfect for this. We will create a Lambda function that “listens” for a file upload event in our raw/ S3 prefix and then starts our AWS Glue ETL job.

Here is a sample Python code for the Lambda function:

lambda_function.py

Python

import boto3
import os

def lambda_handler(event, context):
    """
    This Lambda function is triggered by an S3 event and starts an AWS Glue ETL job.
    """
    # Get the Glue job name from environment variables
    glue_job_name = os.environ['GLUE_JOB_NAME']
    
    # Extract the bucket and key from the S3 event
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    
    print(f"File uploaded: s3://{bucket}/{key}")
    
    # Initialize the Glue client
    glue_client = boto3.client('glue')
    
    try:
        print(f"Starting Glue job: {glue_job_name}")
        response = glue_client.start_job_run(
            JobName=glue_job_name,
            Arguments={
                '--S3_SOURCE_PATH': f"s3://{bucket}/{key}"
            }
        )
        print(f"Successfully started Glue job run. Run ID: {response['JobRunId']}")
        return {
            'statusCode': 200,
            'body': f"Started Glue job {glue_job_name} for file s3://{bucket}/{key}"
        }
    except Exception as e:
        print(f"Error starting Glue job: {e}")
        raise e

To make this work, you need to:

  1. Create this Lambda function in the AWS console.
  2. Set an environment variable named GLUE_JOB_NAME with the name of the Glue job you’ll create in the next step.
  3. Configure an S3 trigger on the function, pointing it to your s3://your-data-lake-bucket/raw/ prefix for “All object create events.”

Step 3: Transforming Data with AWS Glue

AWS Glue is the heavy lifter in our pipeline. It’s a fully managed ETL service that makes it easy to prepare and load your data for analytics. For this step, you would create a Glue ETL job.

Inside the Glue Studio, you can visually build a job or write a PySpark script. The job will:

  1. Read the raw data (e.g., CSV) from the source path passed by the Lambda function.
  2. Perform transformations, such as changing data types, dropping columns, or joining with other datasets.
  3. Write the transformed data to the processed/ S3 prefix in Apache Parquet format. Parquet is a columnar storage format that is highly optimized for analytical queries.

Your Glue job will have a simple script that looks something like this:

Python

import sys
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

# Get job arguments
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'S3_SOURCE_PATH'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read the raw CSV data from S3
source_dyf = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": [args['S3_SOURCE_PATH']]},
    format="csv",
    format_options={"withHeader": True},
)

# Convert to Parquet and write to the processed location
glueContext.write_dynamic_frame.from_options(
    frame=source_dyf,
    connection_type="s3",
    connection_options={"path": "s3://your-data-lake-bucket/processed/"},
    format="parquet",
)

job.commit()

Step 4: Querying Processed Data with Amazon Athena

Once your data is processed and stored as Parquet in S3, it’s ready for analysis. With Amazon Athena, you don’t need to load it into another database. You can query it right where it is.

  1. Create a Database: In the Athena query editor, create a database for your data lake: CREATE DATABASE my_data_lake;
  2. Run a Glue Crawler (or Create a Table): The easiest way to make your data queryable is to run an AWS Glue Crawler on your processed/ S3 prefix. The crawler will automatically detect the schema of your Parquet files and create an Athena table for you.
  3. Query Your Data: Once the table is created, you can run standard SQL queries on it.

SQL

SELECT
    customer_id,
    order_status,
    COUNT(order_id) as number_of_orders
FROM
    my_data_lake.processed_data
WHERE
    order_date >= '2025-01-01'
GROUP BY
    1, 2
ORDER BY
    3 DESC;

Conclusion: The Power of Serverless

You have now built a fully automated, event-driven, and serverless data pipeline on AWS. When a new file lands in your raw S3 bucket, a Lambda function triggers a Glue job that processes the data and writes it back to S3 in an optimized format, ready to be queried instantly by Athena.

This architecture is not only powerful but also incredibly efficient. It scales automatically to handle terabytes of data and ensures you only pay for the resources you use, making it the perfect foundation for a modern data engineering stack.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *