Data Vault Modelling Use Case

7 min readFeb 21, 2022

What exactly is a data warehouse?

A data warehouse is a central repository or central reservoir where all the different source systems of an enterprise can store data. The data is persistent and its history is maintained. It can be used by business users to perform analytics.

There are usually two modelling techniques:

The third normal form

Dimensional Models

Let us try to understand why these two modelling techniques are prominent when it comes to designing a data warehouse.

What exactly is the 3rd normal form?

The 3rd normal form is a technique used to model operational systems. All business sectors within a company store data in some sort of database. This database, which is used to model operational details, is in third normal form.

Example:

You go to a grocery store or a retail shop, and you pay something at the point of sale to purchase an item. All these operations are stored in the company’s database, which is designed in the 3rd Normal Form.

Data is highly normalised in the 3rd normal form. Linking of tables is done in such a way that multiple joins are performed. The granularity of data is as low as possible, which means faster writing into the database. For running business processes efficiently, data should be tuned for faster writes.

On the other hand, dimensional models have different objectives.

What exactly is dimensional modelling?

Dimensional modelling is aimed at providing business value to everyone who wants to understand their enterprise data and get insights from it. The data model, which is designed for faster reads, is dimensional. Over a period, there were two main theories or pillars that were introduced by the two fathers of data warehousing.

Challenges

With the pace at which data and business are growing, you need to be very agile when it comes to your data warehouse modelling. A data model can’t be rigid. It should be flexible and adaptable. This gap was filled by Data Vault.

What exactly is a data vault?

Data Vault was a concept that was coined by Dan Lin Stead, and he understood the gap that was produced by the rigidity of the 3NF data warehousing technique and dimensional model.

He came up with a hybrid approach that took the good things of 3NF and the good things of dimensional models. And then it provided the kind of agility and flexibility that were missing.

Let’s try to understand the concept through a business workflow and the challenges that can be solved by Data Vault.

How it tries to resolve the particular challenges

Use Case:

Suppose this is a data model for renting a house. The data model tells you the business procedure flow for a new property that is leased, and the particulars are linked together with different tables.

You have an owner who owns the property at a certain location. That property is rented to some tenant, and then that tenant has his details, and the owner also has his details.

Using the 3rd normal form:

This is what it would look like in the 3rd NF. It would be highly normalised.

This is how the 3rd normal form variation works:

Suppose there are parent-child relationships in the data model. Then, any changes that occur to the parent must be passed on to multiple children. And if there is any business process that has changed and that changes the primary key of any particular table.

Here, if it is an owner’s details, it is further related to family details. All these keys have to be changed. Because this particular primary key is working as a foreign key, you need to understand the kind of changes you have to make if you have to introduce any small change.

This can be resolved, but it will not be agile and it will be very time-consuming. It is painful, and it causes a lot of inconsistency.

Challenges

Every table will have its primary key and a date timestamp attached to it.

To understand the uniqueness of the record and to avoid duplicates in a 3rd NF, you always have your primary key + DateTime stamp.

Using a Dimensional Model:

Suppose you want to turn it around and we want to change it to a dimensional model.

We will analyse what can be qualified as a dimension and a fact. Then we will create something like this, which is famously called a star schema.

It will have a single fact table standing in the centre and all the dimensions will be surrounded by it. This will give an aggregated view of what you want to report.

We have made a particular rental fact table and it is surrounded by all the dimensions.

These PKs would reside as foreign keys in the fact table for all the dimensions, and then all the figures that you have to calculate. Suppose you want to understand “how much the overall rent can be collected from a certain city and which is the highest hire earner as a user in this city.” All these questions can be answered with a very effective model, which is a dimensional model.

This particular model is very good at reporting. Suppose you want to create some sort of report and do some analytics. If a business wants to derive input from this data model, then it is very useful.

Scenario

Suppose you have created this data model.

This has gone live in production. And all of a sudden, there is a requirement. Oh, I forgot to tell you that I also wanted to analyse “how many rents we now have in a particular pin code”. So you created this dimension, but this dimension was at the city level, and the data that you contain here is at that grain.

Granularity means at what level your facts will be aggregated against your dimensions.

Let’s assume I also want to understand that if I say that the grain of this data model or this fact table is at a city level. Then I don’t have data to drill down and understand what your total rent or total cost or any other measure is below the city hierarchy. The city will have multiple pin codes. So you have to drill down and you have to go to lower grain, but for that, what are the options?

Challenges

You have to change this fact table, reload it with new grain, which might have the risk of losing all your history in this fact table, or you might have to introduce a different fact table. This fact table would have the same data but at a pin code grain.

We might come up with two fact tables.

You have to hit the same source systems twice, and you have to maintain and make changes to this whole data model, your reports, and everything.

So it is not very agile, it is very painful, and it also takes a lot of time. There is a risk of losing history. So, how do we define a data model, a data warehouse, or a data modelling technique? Why a third NF or dimensional model? And what challenges do we face?

Data Vault

The data model is just restructured as per the data vault architecture concept. So a data vault has three main components, which we have to understand:

Hub: It retains business keys. It is just a field in a new table that enterprises can plainly recognise and connect with as far as the particular data is concerned.

Satellite: Descriptive information about the hub. All of its information is subject to change over time.

Link: Relationship between two hubs.

Scenario

Same use case as we discussed above in dimensional models.

This has gone live in production. And all of a sudden, there is a requirement. Business says: “Oh, I forgot to tell you that I also wanted to analyse “how many people own flats at a particular pin code.” So you created a link, but this was at the city level, and the data that you contain here is at that grain.

In Data Vault, you can easily add one more satellite, which will hold information at a more granular level (i.e., pin code level).

All new keys will be added to the hub. There will be another satellite for this location hub, which will have pin code level information. All your keys will be maintained and your relationship will remain intact.

All relationship changes that happen will be reflected in the link table.

Here we see how easily we can change into a data model in a very fast time.

Now, the data model is scalable, adaptable, and consistent :)

Data Vault Modelling Use Case

What exactly is a data warehouse?

What exactly is the 3rd normal form?

Example:

What exactly is dimensional modelling?

Challenges

What exactly is a data vault?

How it tries to resolve the particular challenges

Use Case:

Using the 3rd normal form:

This is how the 3rd normal form variation works:

Challenges

Using a Dimensional Model:

Scenario

Challenges

Data Vault

Scenario

Written by Prabha Arya