Data Modeling for the Modern Data Warehouse: A Guide

In the world of data engineering, it’s easy to get excited about the latest tools and technologies. But before you can build powerful pipelines and insightful dashboards, you need a solid foundation. That foundation is data modeling. Without a well-designed data model, even the most advanced data warehouse can become a slow, confusing, and unreliable “data swamp.”

Data modeling is the process of structuring your data to be stored in a database. For a modern data warehouse, the goal is not just to store data, but to store it in a way that is optimized for fast and intuitive analytical queries.

This guide will walk you through the most important concepts of data modeling for the modern data warehouse, focusing on the time-tested star schema and the crucial concept of Slowly Changing Dimensions (SCDs).

The Foundation: Kimball’s Star Schema

While there are several data modeling methodologies, the star schema, popularized by Ralph Kimball, remains the gold standard for analytical data warehouses. Its structure is simple, effective, and easy for both computers and humans to understand.

A star schema is composed of two types of tables:

Fact Tables: These tables store the “facts” or quantitative measurements about a business process. Think of sales transactions, website clicks, or sensor readings. Fact tables are typically very long and narrow.
Dimension Tables: These tables store the descriptive “who, what, where, when, why” context for the facts. Think of customers, products, locations, and dates. Dimension tables are typically much smaller and wider than fact tables.

Why the Star Schema Works:

Performance: The simple, predictable structure allows for fast joins and aggregations.
Simplicity: It’s intuitive for analysts and business users to understand, making it easier to write queries and build reports.

Example: A Sales Data Model

Fact Table (fct_sales):
- order_id
- customer_key (foreign key)
- product_key (foreign key)
- date_key (foreign key)
- sale_amount
- quantity_sold
Dimension Table (dim_customer):
- customer_key (primary key)
- customer_name
- city
- country
Dimension Table (dim_product):
- product_key (primary key)
- product_name
- category
- brand

Handling Change: Slowly Changing Dimensions (SCDs)

Business is not static. A customer moves to a new city, a product is rebranded, or a sales territory is reassigned. How do you handle these changes in your dimension tables without losing historical accuracy? This is where Slowly Changing Dimensions (SCDs) come in.

There are several types of SCDs, but two are essential for every data engineer to know.

SCD Type 1: Overwrite the Old Value

This is the simplest approach. When a value changes, you simply overwrite the old value with the new one.

When to use it: When you don’t need to track historical changes. For example, correcting a spelling mistake in a customer’s name.
Drawback: You lose all historical context.

SCD Type 2: Add a New Row

This is the most common and powerful type of SCD. Instead of overwriting, you add a new row for the customer with the updated information. The old row is kept but marked as “inactive.” This is typically managed with a few extra columns in your dimension table.

Example dim_customer Table with SCD Type 2:

customer_key	customer_id	customer_name	city	is_active	effective_date	end_date
101	CUST-A	Jane Doe	New York	false	2023-01-15	2024-08-30
102	CUST-A	Jane Doe	London	true	2024-09-01	9999-12-31

When Jane Doe moved from New York to London, we added a new row (key 102).
The old row (key 101) was marked as inactive.
This allows you to accurately analyze historical sales. Sales made before September 1, 2024, will correctly join to the “New York” record, while sales after that date will join to the “London” record.

Conclusion: Build a Solid Foundation

Data modeling is not just a theoretical exercise; it is a practical necessity for building a successful data warehouse. By using a clear and consistent methodology like the star schema and understanding how to handle changes with Slowly Changing Dimensions, you can create a data platform that is not only high-performing but also a reliable and trusted source of truth for your entire organization. Before you write a single line of ETL code, always start with a solid data model.

Data Modeling for the Modern Data Warehouse: A Guide

The Foundation: Kimball’s Star Schema

Handling Change: Slowly Changing Dimensions (SCDs)

SCD Type 1: Overwrite the Old Value

SCD Type 2: Add a New Row

Conclusion: Build a Solid Foundation

Comments

Leave a Reply Cancel reply

More posts

Snowflake Openflow Tutorial Guide 2025

A Data Engineer’s Handbook to Snowflake Performance and SQL Improvements 2025

Snowflake Native dbt Integration: Complete 2025 Guide

Your First Salesforce Copilot Action : A 5-Step Guide