Lesson 9

Snowflake Schema in SQL: Normalised Data Warehouse Design

Understand how a snowflake schema extends the star schema so you can navigate multi-level dimension joins in analytical databases.

You're building a product sales report and you want to group revenue by category, not by individual product. The products table has a category_id column, but the category name isn't there. It lives in a separate table. That extra step, from product to category, is the snowflake schema in action. Instead of one flat dimension table, the data is split across two.

How the Snowflake Schema Differs from a Star Schema

In a [star schema](sqlround.com/lessons/sql-star-schema), each dimension table sits directly next to the fact table. One join gets you from the fact to the context you need. The snowflake schema takes that further. Some dimension tables don't hold all their descriptive data directly. Instead, they reference another table for part of it.

The result is a chain. Fact table to dimension table to sub-dimension table. Each link is one more JOIN.

Key idea: A snowflake schema is a star schema where at least one dimension table has been split into two or more levels.

In the ecommerce schema, products has a category_id column but no category_name column. That column would live in a separate categories table in a fully normalised setup. The products table is the first-level dimension. The categories table would be the second level. This is exactly the snowflake pattern.

The ecommerce schema doesn't include a categories table by name, but category_id in products is the foreign key that would link to one. In the exercises, you'll work with the data that is available. In the lesson body, we'll use the products and order_items tables to show how multi-level joins work when a dimension table references another layer of context.

Writing Queries Across Snowflake Levels

The practical difference between a star schema and a snowflake schema shows up in your FROM clause. You need more JOINs.

Start with a simple star-style query: revenue per product.

SELECT p.product_name, SUM(oi.line_total) AS revenue
FROM order_items oi
JOIN products p ON oi.product_id = p.product_id
GROUP BY p.product_name

That's one join. One level of dimension. Now imagine you need revenue grouped by brand instead of by product. brand is also in products, so it's still one join. But if brand were stored in a separate brands table, you'd need a second join to reach it.

-- if brand were in a separate table, you'd need this pattern
SELECT p.brand, SUM(oi.line_total) AS revenue
FROM order_items oi
JOIN products p ON oi.product_id = p.product_id
GROUP BY p.brand
ORDER BY revenue DESC

In a snowflake schema, that second dimension becomes its own table. The query structure stays the same: fact table first, then each dimension in order, each joined on its foreign key.

The chain matters. You can't skip a level. If you want data from the third table in a chain, you must JOIN through the second table first.

Normalisation and Why Snowflake Schemas Exist

The reason dimension tables get split is normalisation. If a thousand products all share the same category name, storing that name a thousand times in the products table wastes space and creates inconsistency risk. One update to the category name has to touch a thousand rows.

With a snowflake schema, the category name lives once in a categories table. The products table just stores the category_id. Update the name in one place, and every product reflects it immediately.

Key idea: Snowflake schemas reduce data duplication by storing shared descriptive values once and referencing them by ID.

The trade-off is query complexity. Every extra level of normalisation is another JOIN. For reporting databases where queries run constantly, that can matter. For transactional databases where data integrity matters more, it's often worth it.

This is why many [data warehouse](sqlround.com/lessons/sql-star-schema) environments deliberately de-normalise back toward the star schema. They accept some duplication in exchange for faster, simpler queries.

How Snowflake Schema Connects to Other Concepts

The JOIN syntax in a snowflake schema is identical to any other [JOIN](sqlround.com/lessons/sql-joins). The only difference is how many times you use it in one query.

Understanding normalisation also connects to [ER diagrams](sqlround.com/lessons/sql-er-diagrams). An ER diagram shows you the exact chain of tables you need to traverse in a snowflake schema before you write a single line.

Common Mistakes

1. Skipping an intermediate table in the chain

Trying to join a sub-dimension directly to the fact table fails if there's no direct foreign key between them.

-- wrong: skipping products and jumping straight to a hypothetical categories table
SELECT oi.order_id, oi.line_total
FROM order_items oi
JOIN products p ON oi.product_id = p.category_id

Join through each level in order, using the correct foreign key at each step.

-- correct
SELECT p.product_name, oi.line_total
FROM order_items oi
JOIN products p ON oi.product_id = p.product_id

2. Grouping by an ID column instead of a name column

When the descriptive label lives in a sub-dimension table, grouping by the ID from the dimension table gives you numbers, not readable categories.

-- wrong
SELECT p.category_id, SUM(oi.line_total) AS revenue
FROM order_items oi
JOIN products p ON oi.product_id = p.product_id
GROUP BY p.category_id

Join through to the table that holds the label, and group by the name column instead.

-- correct: using brand as the grouping label since category_name isn't in this schema
SELECT p.brand, SUM(oi.line_total) AS revenue
FROM order_items oi
JOIN products p ON oi.product_id = p.product_id
GROUP BY p.brand
ORDER BY revenue DESC

3. Joining dimension tables to each other instead of through the fact table

In both star and snowflake schemas, dimension tables at the same level don't connect to each other. They connect through the fact table or through a parent dimension.

-- wrong
SELECT p.product_name, c.first_name
FROM products p
JOIN customers c ON p.product_id = c.customer_id

Always start from the fact table and branch outward to each dimension separately.

-- correct
SELECT p.product_name, c.first_name, oi.line_total
FROM order_items oi
JOIN products p ON oi.product_id = p.product_id
JOIN orders o ON oi.order_id = o.order_id
JOIN customers c ON o.customer_id = c.customer_id

Quick Recap

  • A snowflake schema extends the star schema by splitting flat dimension tables into multiple related tables, each joined by a foreign key.
  • Each extra level of normalisation removes duplicated data but adds one more JOIN to your queries.
  • In the ecommerce schema, products referencing category_id is an example of a dimension table pointing to a sub-dimension.
  • You cannot skip levels in a snowflake chain: each table must be joined in order from the fact table outward.
  • Star schemas are faster to query; snowflake schemas are cleaner to maintain. Most real databases sit somewhere between the two.
  • Recognising the snowflake pattern helps you write multi-level JOIN queries with confidence rather than guessing which tables connect.
Exercise 1 Easy
Write a query to return the product_name, brand, and category_id for every product in the products table.
Hint: SELECT the three columns directly from products. No joins needed.
Solution
SELECT p.product_name, p.brand, p.category_id
FROM products p
Exercise 2 Easy
Write a query to return the product_name and category_id for every product, sorted by category_id ascending so products in the same category appear together.
Hint: ORDER BY category_id ASC after selecting from products.
Solution
SELECT p.product_name, p.category_id
FROM products p
ORDER BY p.category_id ASC
Exercise 3 Medium
Write a query to return each brand from products and the total line_total sold for that brand across all orders, using the column alias total_revenue.
Hint: JOIN order_items and products on product_id, GROUP BY brand, and SUM line_total.
Solution
SELECT p.brand, SUM(oi.line_total) AS total_revenue
FROM order_items oi
JOIN products p ON oi.product_id = p.product_id
GROUP BY p.brand
Exercise 4 Medium
Write a query to return each brand and its total line_total, but only include brands where the total revenue exceeds 500, sorted by total revenue descending.
Hint: Use HAVING to filter the grouped result. Reference the SUM in both HAVING and ORDER BY.
Solution
SELECT p.brand, SUM(oi.line_total) AS total_revenue
FROM order_items oi
JOIN products p ON oi.product_id = p.product_id
GROUP BY p.brand
HAVING SUM(oi.line_total) > 500
ORDER BY total_revenue DESC
Exercise 5 Medium
Write a query to return the country of each customer and the total line_total from their purchases, grouped by country and sorted from highest to lowest, using the column alias total_spent.
Hint: JOIN order_items to orders on order_id, then JOIN orders to customers on customer_id. GROUP BY country and SUM line_total.
Solution
SELECT c.country, SUM(oi.line_total) AS total_spent
FROM order_items oi
JOIN orders o ON oi.order_id = o.order_id
JOIN customers c ON o.customer_id = c.customer_id
GROUP BY c.country
ORDER BY total_spent DESC
Frequently Asked Questions

A snowflake schema is a database design where dimension tables are broken into sub-tables instead of being flat. Each dimension can have one or more levels of related tables branching off it, which is why it looks like a snowflake.

In a star schema, each dimension table is flat and connects directly to the fact table. In a snowflake schema, dimension tables are split into multiple related tables, adding extra JOIN steps but reducing repeated data.

Use a snowflake schema when data duplication in dimension tables is a concern, or when you need to maintain a single source of truth for shared reference data like categories or regions.

It can, because each additional level of normalisation adds a JOIN. Star schema queries are typically faster to write and run, which is why many reporting tools prefer them.