That is the question.
If you have been following all the BLOG posts at KPI Forge, you’ll notice that we like to touch on a lot of different topics. Most of these topics are all real-world scenarios and solutions that we are either currently working on, have done in the past, or are considering for the future. We currently have a case where we are debating a very complex relational data model, or pursuing a new path. Graph Databases.
What is a graph database?
To steal from AWS. “Graph databases are purpose-built to store and navigate relationships. Relationships are first-class citizens in graph databases, and most of the value of graph databases is derived from these relationships. Graph databases use nodes to store data entities, and edges to store relationships between entities.” If you would like to read the entire article, here it is in all its glory! https://aws.amazon.com/nosql/graph/#:~:text=Graph%20databases%20are%20purpose%2Dbuilt,to%20store%20relationships%20between%20entities.
Why would we pursue this? Well… in our case we were trying to model a dimension for a “person”. In this case our dim_person table could be derived from over a half a dozen different data sources. To get this modeled into a unified data model, we find that we would have some very sparsely populated columns in this dim_person table because all the contextual information we wanted to hold in our dimension, didn’t always exist in each data source. Our great partners at know.bi introduced us to what a graph database could be used for. There are specific solutions available for a graph database including neo4J, and AWS Neptune if you want to do some more research.
Having this new technology to help with complex relationships seems like it might be a viable option! If you look at marketing for these graph databases, you’ll see usecases such as fraud detection, Language Processing, and other advanced use-cases. In our case, we are just going to pursue this due to highly complex dimension tables. These tables typically hold a lot of technical debt when it comes to maintenance, change data capture (CDC), and future additions of new values. Graph databases solve many of these challenges.
We are currently putting together a Proof of Concept to see how a graph could solve these problems for our complex and sparsely populated dimensional data. Things look promising, but we want to see how this could work at scale. We already know that we cannot put our BI tools directly into these graph databases due to the query language that is used to extract data. Cypher Query Language is the standard for querying Graph databases VS SQL for relational databases. We will probably use our favorite ETL tool Apache Hop to run the Cypher queries, then load to Serverless Redshift on AWS. This will allow us to query directly with Amazon Quicksight for visualization.
We’ll do a follow-up post once we prove some things out. In the meantime, we would love to discuss your data engineering and analytics projects and put our experience to work for you!
Comments