In the world of data engineering, it’s easy to get excited about the latest tools and technologies. But before you can build powerful pipelines and insightful dashboards, you need a solid foundation. That foundation is data modeling. Without a well-designed data model, even the most advanced data warehouse can become a slow, confusing, and unreliable “data swamp.”
Data modeling is the process of structuring your data to be stored in a database. For a modern data warehouse, the goal is not just to store data, but to store it in a way that is optimized for fast and intuitive analytical queries.
This guide will walk you through the most important concepts of data modeling for the modern data warehouse, focusing on the time-tested star schema and the crucial concept of Slowly Changing Dimensions (SCDs).

The Foundation: Kimball’s Star Schema
While there are several data modeling methodologies, the star schema, popularized by Ralph Kimball, remains the gold standard for analytical data warehouses. Its structure is simple, effective, and easy for both computers and humans to understand.
A star schema is composed of two types of tables:
- Fact Tables: These tables store the “facts” or quantitative measurements about a business process. Think of sales transactions, website clicks, or sensor readings. Fact tables are typically very long and narrow.
- Dimension Tables: These tables store the descriptive “who, what, where, when, why” context for the facts. Think of customers, products, locations, and dates. Dimension tables are typically much smaller and wider than fact tables.
Why the Star Schema Works:
- Performance: The simple, predictable structure allows for fast joins and aggregations.
- Simplicity: It’s intuitive for analysts and business users to understand, making it easier to write queries and build reports.
Example: A Sales Data Model
- Fact Table (
fct_sales):order_idcustomer_key(foreign key)product_key(foreign key)date_key(foreign key)sale_amountquantity_sold
- Dimension Table (
dim_customer):customer_key(primary key)customer_namecitycountry
- Dimension Table (
dim_product):product_key(primary key)product_namecategorybrand
Handling Change: Slowly Changing Dimensions (SCDs)
Business is not static. A customer moves to a new city, a product is rebranded, or a sales territory is reassigned. How do you handle these changes in your dimension tables without losing historical accuracy? This is where Slowly Changing Dimensions (SCDs) come in.
There are several types of SCDs, but two are essential for every data engineer to know.
SCD Type 1: Overwrite the Old Value
This is the simplest approach. When a value changes, you simply overwrite the old value with the new one.
- When to use it: When you don’t need to track historical changes. For example, correcting a spelling mistake in a customer’s name.
- Drawback: You lose all historical context.
SCD Type 2: Add a New Row
This is the most common and powerful type of SCD. Instead of overwriting, you add a new row for the customer with the updated information. The old row is kept but marked as “inactive.” This is typically managed with a few extra columns in your dimension table.
Example dim_customer Table with SCD Type 2:
| customer_key | customer_id | customer_name | city | is_active | effective_date | end_date |
| 101 | CUST-A | Jane Doe | New York | false | 2023-01-15 | 2024-08-30 |
| 102 | CUST-A | Jane Doe | London | true | 2024-09-01 | 9999-12-31 |
- When Jane Doe moved from New York to London, we added a new row (key 102).
- The old row (key 101) was marked as inactive.
- This allows you to accurately analyze historical sales. Sales made before September 1, 2024, will correctly join to the “New York” record, while sales after that date will join to the “London” record.
Conclusion: Build a Solid Foundation
Data modeling is not just a theoretical exercise; it is a practical necessity for building a successful data warehouse. By using a clear and consistent methodology like the star schema and understanding how to handle changes with Slowly Changing Dimensions, you can create a data platform that is not only high-performing but also a reliable and trusted source of truth for your entire organization. Before you write a single line of ETL code, always start with a solid data model.

Leave a Reply