Star and Snowflake Schema’s - The gift that keeps on giving!

Reilli Sprague
Sep 8, 2023
2 min read

Data analytics has come a long way. Hardware, Software, Storage, User Interfaces, etc. They have all gotten cheaper and better over the past few decades. Moore’s law in full effect! However, even with progress in all these areas, it still does not take the place of a good efficient data model! I see all the BI vendors tout how easy it is to quickly upload data, load files into cloud storage, and do instant analytics on them. But….. What about Change Data Capture (CDC), Efficient storage, and scaling to terabytes, or petabytes of data? Just analyzing CSV or flat files will never play out well in the long term. They are only good for a quick Proof of Concept, or Proof of Value. Those files are great just to make sure the data that you do have will answer the KPI’s you are looking to analyze. They are not great for scaling or multiple source systems.

The Kimball Design Methodology or Inmon modeling techniques are still extremely valid methodologies that allow for scale, performance, and minimal size for your data warehouse. It would behoove anyone to grab a copy of Agile Data Warehouse Design and learn these techniques from the ground up. Let's cover the basics though. We want to be able to create a fact/dimension model. The fact table will only hold things that can be added, summed, averaged, (measures), boolean, flags, and keys to dimensions. Any sort of text fields that can be repeated in any way, or things that may add context to your measures should be broken out into dimension tables.

Lets think about the following CSV that would come from a transactional Point of Sale system.

Transaction ID	Date	Product	Customer	Amount	Refund
123456	2023-09-06	Apples	John Doe	1.25	Y
123457	2023-09-07	Oranges	Jane Doe	1.30	N

What items here should be put on the fact table? What items should get their own dimension?

Following best practices, you would create the following dimensions that will have integer keys.

Dim_Date

Date_key	Date	DayOfWeek	Quarter	Year	Period
20230906	2023-09-06	Wednesday	3	2023	36
20230907	2023-09-06	Thursday	3	2023	36

Dim_product

product_key	description	SKU	Color
1	Apples	12345	Red
2	Oranges	54321	Orange

Dim_customer

customer_key	First Name	Last Name	Full Name	City	State
1	John	Doe	John Doe	New York	NY
2	Jane	Doe	Jane Doe	Denver	CO

With each of these dimensions configured, you would then have a fact table comprised of

Fact_sales

Transaction ID	date_key	product_key	customer_key	Amount	Refund
123456	20230906	1	1	1.25	Y
123457	20230907	2	2	1.30	N

This makes your fact table optimized for SQL by only joining with integers, providing all the extra context about your dimensions to do analysis, and easy storage and scaling of your fact table to millions or billions of rows. Also, if you’re using traditional RDBMS systems like PostGres, MySQL, MariaDB, Oracle, DB2, don’t forget your foreign key indexes! Also, keep in mind that there are serverless offerings for PostGres, Mysql, Redshift etc. from all the major cloud providers. That means you only pay when you’re querying data and updating data! This will save you on instance time, storage costs, and will provide long term scalable performance!

As always, we would love to talk with you about your data, and show you how to leverage it more efficiently and effectively! Let’s talk about your project today!

Get Started

Star and Snowflake Schema’s - The gift that keeps on giving!

Recent Posts

Comments

Subscribe to Our Newsletter