top of page

Star and Snowflake Schema’s - The gift that keeps on giving!

Data analytics has come a long way. Hardware, Software, Storage, User Interfaces, etc. They have all gotten cheaper and better over the past few decades. Moore’s law in full effect! However, even with progress in all these areas, it still does not take the place of a good efficient data model! I see all the BI vendors tout how easy it is to quickly upload data, load files into cloud storage, and do instant analytics on them. But….. What about Change Data Capture (CDC), Efficient storage, and scaling to terabytes, or petabytes of data? Just analyzing CSV or flat files will never play out well in the long term. They are only good for a quick Proof of Concept, or Proof of Value. Those files are great just to make sure the data that you do have will answer the KPI’s you are looking to analyze. They are not great for scaling or multiple source systems.


The Kimball Design Methodology or Inmon modeling techniques are still extremely valid methodologies that allow for scale, performance, and minimal size for your data warehouse. It would behoove anyone to grab a copy of Agile Data Warehouse Design and learn these techniques from the ground up. Let's cover the basics though. We want to be able to create a fact/dimension model. The fact table will only hold things that can be added, summed, averaged, (measures), boolean, flags, and keys to dimensions. Any sort of text fields that can be repeated in any way, or things that may add context to your measures should be broken out into dimension tables.


Lets think about the following CSV that would come from a transactional Point of Sale system.


Transaction ID

Date

Product

Customer

Amount

Refund

123456

2023-09-06

Apples

John Doe

1.25

Y

123457

2023-09-07

Oranges

Jane Doe

1.30

N


What items here should be put on the fact table? What items should get their own dimension?


Following best practices, you would create the following dimensions that will have integer keys.


Dim_Date

Date_key

Date

DayOfWeek

Quarter

Year

Period

20230906

2023-09-06

Wednesday

3

2023

36

20230907

2023-09-06

Thursday

3

2023

36


Dim_product


product_key

description

SKU

Color

1

Apples

12345

Red

2

Oranges

54321

Orange


Dim_customer


customer_key

First Name

Last Name

Full Name

City

State

1

John

Doe

John Doe

New York

NY

2

Jane

Doe

Jane Doe

Denver

CO


With each of these dimensions configured, you would then have a fact table comprised of


Fact_sales


Transaction ID

date_key

product_key

customer_key

Amount

Refund

123456

20230906

1

1

1.25

Y

123457

20230907

2

2

1.30

N


This makes your fact table optimized for SQL by only joining with integers, providing all the extra context about your dimensions to do analysis, and easy storage and scaling of your fact table to millions or billions of rows. Also, if you’re using traditional RDBMS systems like PostGres, MySQL, MariaDB, Oracle, DB2, don’t forget your foreign key indexes! Also, keep in mind that there are serverless offerings for PostGres, Mysql, Redshift etc. from all the major cloud providers. That means you only pay when you’re querying data and updating data! This will save you on instance time, storage costs, and will provide long term scalable performance!


As always, we would love to talk with you about your data, and show you how to leverage it more efficiently and effectively! Let’s talk about your project today!




21 views0 comments

Recent Posts

See All

Comments


bottom of page