Data analytics has come a long way. Hardware, Software, Storage, User Interfaces, etc. They have all gotten cheaper and better over the past few decades. Moore’s law in full effect! However, even with progress in all these areas, it still does not take the place of a good efficient data model! I see all the BI vendors tout how easy it is to quickly upload data, load files into cloud storage, and do instant analytics on them. But….. What about Change Data Capture (CDC), Efficient storage, and scaling to terabytes, or petabytes of data? Just analyzing CSV or flat files will never play out well in the long term. They are only good for a quick Proof of Concept, or Proof of Value. Those files are great just to make sure the data that you do have will answer the KPI’s you are looking to analyze. They are not great for scaling or multiple source systems.
The Kimball Design Methodology or Inmon modeling techniques are still extremely valid methodologies that allow for scale, performance, and minimal size for your data warehouse. It would behoove anyone to grab a copy of Agile Data Warehouse Design and learn these techniques from the ground up. Let's cover the basics though. We want to be able to create a fact/dimension model. The fact table will only hold things that can be added, summed, averaged, (measures), boolean, flags, and keys to dimensions. Any sort of text fields that can be repeated in any way, or things that may add context to your measures should be broken out into dimension tables.
Lets think about the following CSV that would come from a transactional Point of Sale system.
Transaction ID | Date | Product | Customer | Amount | Refund |
123456 | 2023-09-06 | Apples | John Doe | 1.25 | Y |
123457 | 2023-09-07 | Oranges | Jane Doe | 1.30 | N |
What items here should be put on the fact table? What items should get their own dimension?
Following best practices, you would create the following dimensions that will have integer keys.
Dim_Date
Date_key | Date | DayOfWeek | Quarter | Year | Period |
20230906 | 2023-09-06 | Wednesday | 3 | 2023 | 36 |
20230907 | 2023-09-06 | Thursday | 3 | 2023 | 36 |
Dim_product
product_key | description | SKU | Color |
1 | Apples | 12345 | Red |
2 | Oranges | 54321 | Orange |
Dim_customer
customer_key | First Name | Last Name | Full Name | City | State |
1 | John | Doe | John Doe | New York | NY |
2 | Jane | Doe | Jane Doe | Denver | CO |
With each of these dimensions configured, you would then have a fact table comprised of
Fact_sales
Transaction ID | date_key | product_key | customer_key | Amount | Refund |
123456 | 20230906 | 1 | 1 | 1.25 | Y |
123457 | 20230907 | 2 | 2 | 1.30 | N |
This makes your fact table optimized for SQL by only joining with integers, providing all the extra context about your dimensions to do analysis, and easy storage and scaling of your fact table to millions or billions of rows. Also, if you’re using traditional RDBMS systems like PostGres, MySQL, MariaDB, Oracle, DB2, don’t forget your foreign key indexes! Also, keep in mind that there are serverless offerings for PostGres, Mysql, Redshift etc. from all the major cloud providers. That means you only pay when you’re querying data and updating data! This will save you on instance time, storage costs, and will provide long term scalable performance!
As always, we would love to talk with you about your data, and show you how to leverage it more efficiently and effectively! Let’s talk about your project today!
Comments