I know I should have written this article long before but as they say, "better late than never". In this multi part tutorial we will learn the basics of dimensional modeling and we will see how to use this modeling technique in real life scenario. At the end of this tutorial you will become a confident dimensional data modeler.
No previous knowledge in dimensional modeling (or any other modeling) is a prerequisite for this tutorial. However I assume that you already know what is a data warehouse, you have working knowledge in database and preferably you have seen or worked in a data warehousing project before.
Ok, so let's get started.
Dimensional modeling is one of the methods of data modeling, that help us store the data in such a way that it is relatively easy to retrieve the data from the database.
All the modeling techniques give us different ways to store the data. Different ways of storing data gives us different advantages. For example, ER Modeling gives us the advantage of storing data is such a way that there is less redundancy. Dimensional modeling, on the other hand, give us the advantage of storing data in such a fashion that it is easier to retrieve the information from the data once the data is stored in database.
This is the reason why dimensional modeling is used mostly in data warehouses built for reporting. On the other side, dimensional model is not a good solution if your primary purpose of your data modeling is to reduce storage space requirement, reduce redundancy, speed-up loading time etc. Later on the tutorial we will learn why is it so.
Now that we know the reasons behind creating a dimensional modeling, let's find out what exactly is done in this type of models.
In dimensional model, everything is divided in 2 distinct categories - dimension or measures. Anything we try to model, must fit in one of these two categories. So let's say, I want to store information of how many burgers and fries are getting sold per day from a single McDonalds outlet, we will have to first classify this data in dimension and measures. And then we will have 2 different categories of tables (i.e. dimension table and measure table a.k.a fact table to store them).
If you want to understand how to classify data in dimensions and facts in greater detail, please read Classify data for successful modeling
In the following examples we will choose a practical business scenario and see how to identify dimensions and facts to model the scenario
Our objective is to create a data model that can store how many burgers and fries are getting sold from a specific McDonalds outlet per day.
The whole modeling approach is divided in 4 or 5 steps as depicted below.
Dimensions are the object or context. That is - dimensions are the 'things' about which something is being spoken. In the above statement, we are speaking about 3 different things - we are speaking about some "food", some specific McDonalds "store" and some specific "day". So we have 3 dimensions - "food" (e.g. burgers and fries), "store" and "day". Burgers and fries are 2 different members of "food" dimension. We will have to create separate tables for separate dimensions.
Measures are the quantifiable subjects and these are often numeric in nature. In the above statement, the number of burgers/fries sold is a measure. Measures are not stored in the dimension tables. A separate table is created for storing measures. This table is called Fact Table.
Now that we have decided we need 3 tables to store the information of 3 dimensions, next we need to know what are the properties or attributes of each dimension that we need to store in our table. This is important since knowing the properties let us decide what columns are required to be created in each dimension table.
As you might have guessed, each dimension might have number of different properties, but for a given context, not all of them are relevant for us. As an example, let's take the dimension "food". We can think of so many different attributes of food - e.g. names of the food, price of the food, total calories in the food, color of the food and so on. But as I said, we need to check which of these attributes are relevant to us - that is - which of these attributes are required for reporting on this data. As for the given statement above, we just need to know only one attribute of the "food" dimension - i.e. name of the food (burger or fries). So the structure of our food dimension will be rather easy. Like below:
KEY NAME ----------- 1 Burger 2 Fries
Similarly, the structure of our store and day dimensions will be like this:Store
KEY NAME ----------- 1 Store 1 .... .....Day
KEY DAY ----------- 1 01 Jan 2012 2 02 Jan 2012 3 03 Jan 2012 .... ....
As I said, this is really a super simplified structure as we are only interested about basic attribute. But in a complex scenario, we might need to add tens or hundreds on columns to each dimension table if those attributes are required for reporting.
Also note, each dimension table above has a key column.
Key is a not null and unique column which help us identify each record of the table.
I need to explain what is meant by "granularity".
"Granularity" refers to the lowest (or most granular) level of information stored in any table.
Lets take this example, if I say, a specific McDonalds store sells 200 burgers on a specific day and 5000 burgers on a specific month, then in the first case the granularity of my information is daily whereas in the second case the granularity of my information is monthly.
It is important to identify the granularity of the information required. In our case, we need the information on a daily basis. But if my requirement was "To store how many burgers and fries are getting sold from a specific McDonalds outlet per month", required granularity would have changed to monthly from daily. Why is this important? This is important because this information help us decide what columns are required to be stored in our fact table.
For example, since in our case the granularity is food getting sold per store per day, we will need to add key columns from food / store and Day dimensions to the Fact table like below:
If you have followed all the above steps till now, you have then designed 3 dimension tables and 1 fact table. The fact table stores the "number" of food sold in "Quantity" column against a given store, food and day columns. These store/food/day columns are basically foreign key columns of the primary keys in respective dimension tables. The picture of this design is displayed below. This kind of schema is called "Star Schema" because of the star like formation.
The above schema is certainly capable of storing all the information that we intended to store in our dimensional modeling. However, there is a subtle problem. The problem is, we are not sure what would happen if any attribute of any dimension get changed in the future. Let's say McDonalds decided to change the name of the food "burger" to "jumbo burger" for some promotional reason. If they do that, they would update the burger record in the dimension table and update the name to "jumbo burger". So far so good. But the problem is we will lose the old information once they change the name. This means, after they change the name if you look at the data in the model, you will not know that until now the product was called "burger" and not "jumbo burger".
This is a problem if one of the objectives of your modeling is to store history.
Fortunately, this can be solved by designing the dimension tables as "slowly changing dimension". Identify which dimensions are slowly changing (or fast changing or unchanging) is the last and final step of modeling. In the part 3 of this tutorial we will see how to handle these different types of dimensions. But before that, let's continue our discussion on various schema on dimensional modeling next.