In our earlier article we have discussed the need of storing historical information in dimensional tables. We have also learnt about various types of changing dimensions. In this article we will pick "slowly changing dimension" only and learn in detail about various types of slowly changing dimensions and how to design them.
Slowly changing dimensions, referred as SCD henceforth, can be modeled basically in 3 different ways based on whether we want to store full histories, partial histories or no history. These different types are called Type 2, Type 3 and Type 1 respectively. Next we will learn them in detail.
As mentioned above, we design a dimension as SCD type 1 when we do not want to store the history. That is, whenever some values are modified in the attributes, we just want to update the old values with the new values and we do not care about storing the previous history.
We do not store any history in SCD Type 1
Please mind, this is not same as "Unchanged Dimension" discussed in the previous article. In case of an unchanged dimension, we assume that the values of the attributes of that dimension will not change at all. On the other hand, here in case of a SCD Type 1 dimension, we assume that the values of the attributes will change slowly, however, we are not interested to store those changes. We are only interested to store the current or latest value. So every time it changes we will update the old value with new ones.
Technically, from ETL design perspective (Now, if you don't know what is ETL, you don't have to bother about this paragraph - you can go to the next section) SCD Type 1 dimensions are loaded using "Merge" operation which is also known as "UPSERT" as an abbreviation of "Update else Insert".
SCD Type 1 dimensions are loaded by Merge operations
In "UPSERT" method, each row coming from the source is compared will all the records present in the target dimension table based on the natural key and checked if the source record already exists in the target or not. If the row exists in the target, the target row is updated with new values coming from source system. However if the row is not present in the target system, the source row is inserted in the target table.
In pure ANSI SQL syntax, there is a particular statement that help you achieve the UPSERT operation. It's called "MERGE" statement
MERGE INTO Target_Dimension_Table tgt USING source_table src ON tgt.natural_key = src.natural_key WHEN MATCHED THEN UPDATE SET tgt.column1 = src.value1, tgt.column2 = src.value2, ... WHEN NOT MATCHED THEN INSERT (tgt.column1 , tgt.column2 ...) VALUES (src.value1 , src.value2 ...
As obvious from this example, you have to store the natural key of the data in the target dimension table in order to perform this comparison. Later, I will write a separate article on ETL architecture design, where I will talk about this in more detail. But from a modeling perspective, please note that as a data modeler you should add one extra column in your target dimension table as a place holder to store the natural key of the data.
Arguably, this is the most popular type of slowly changing dimensions. So we will try to learn this as clearly as possible.
Let me come one step backward here and remind you again about what is our objective here. As you can recall, in the previous articles we have learnt how the values of the attributes (or columns) in the dimension table change with time. We are trying to store the histories of such changes for the purpose of analysis.
In Type 1, we were not storing any history. However, now we are going to learn how may we design a dimension table so that we can store the full history and always extract the history of changes as and when we require that. We will take our "Food" dimension table as an example here, where "Price" is a variable factor.
KEY NAME TYPE_KEY PRICE 1 Chicken Burger 1 3.70 2 Veggie Burger 1 3.20 3 French Fries 2 2.00 4 Twister Fries 2 2.20
In order to design the above table as SCD Type 2, we will have to add 3 more columns in this table, "Date From", "Date To" and "Latest Flag". These columns are called type 2 metadata columns. See below:
KEY NAME TYPE_KEY PRICE DATE_FROM DATE_TO Latest_FLG 1 Chicken Burger 1 3.70 01-Jan-11 31-Dec-99 Y 2 Veggie Burger 1 3.20 01-Jan-11 31-Dec-99 Y 3 French Fries 2 2.00 01-Jan-11 31-Dec-99 Y 4 Twister Fries 2 2.20 01-Jan-11 31-Dec-99 Y
Notice here, how the values of these 3 new columns are populated. In the very beginning, when any new record is loaded in the table, we automatically default the values of "date from" to the date of the day of the loading, "Date To" to some far future date (e.g., 31st December 2099) and "Latest Flag" to "Y".
What is the meaning of these 3 metadata columns?
These 3 columns basically tell us whether a particular record in the table is latest or not and what is the time period during which the record was latest (Also known as active period). For example, data in the above table basically says that all the 4 records are latest (active) and they are active from the day of loading (in this case 1st January 2011) until an indefinite future date (31st December 2099).
But how does these columns help us store the change history?
Lets assume, today is 15 March 2011, and McDonald has decided to increase the price of "Veggie Burger" from $3.20 to $3.25. If this happens we will not straight away update the price from $3.20 to $3.25. Instead to store this new information (and also the old information), we will insert a new record in the "Food" dimension table which will look like below:
KEY NAME TYPE_KEY PRICE DATE_FROM DATE_TO Latest_FLG 1 Chicken Burger 1 3.70 01-Jan-11 31-Dec-99 Y 2 Veggie Burger 1 3.20 01-Jan-11 14-Mar-11 N 3 French Fries 2 2.00 01-Jan-11 31-Dec-99 Y 4 Twister Fries 2 2.20 01-Jan-11 31-Dec-99 Y 5 Veggie Burger 1 3.25 15-Mar-11 14-Mar-11 Y
Observe the change in the records with Key 2 and 5. Record 2, which was the original record for the veggie burger, has now got updated as its latest flag has become 'N' and "Date To" column value has changed to "14-Mar-2011". This means, Record 2 is no longer latest or active (Latest Flag = "N") and it was active earlier during the period 1st Jan 2011 (Date From) to 14 Mar 2011 (Date To).
So, if Record 2 is not active, what is the latest record for "Veggie Burger" now? Record 5! Its latest flag is set to "Y" and it says that that the record is active since 15 March 2011.
This record will remain active many years in the far-off future (until 31 Dec 2099) or at least unless a new record is inserted again with latest flag Y and this record is updated again with Latest Flag N. So next time again, let's say on 20 Dec 2011, McDonalds again decide to change the price of Veggie Burger back to $3.20 and increase the price of the chicken burger from $3.70 to $3.90, we will see 2 more new records in the table as below:
KEY NAME TYPE_KEY PRICE DATE_FROM DATE_TO Latest_FLG 1 Chicken Burger 1 3.70 01-Jan-11 19-Dec-11 N 2 Veggie Burger 1 3.20 01-Jan-11 14-Mar-11 N 3 French Fries 2 2.00 01-Jan-11 31-Dec-99 Y 4 Twister Fries 2 2.20 01-Jan-11 31-Dec-99 Y 5 Veggie Burger 1 3.25 15-Mar-11 19-Dec-11 N 6 Chicken Burger 1 3.80 20-Dec-11 31-Dec-99 Y 7 Veggie Burger 1 3.20 20-Dec-11 31-Dec-99 Y
As you can see from the design above, it is now possible to go back to any date in the history and figure out what was the value of the "Price" attribute of "Food" dimension at that point in time.
Note from the above example that, each time we generate a new row in the dimension table, we also assign a new key to the record. This is the key that flows down to the fact table in a typical Star schema design. The value of this key, that is the numbers like 1, 2, 3, …. , 7 etc. are not coming from the source systems. Instead those numbers are just like sequential running numbers which are generated automatically at the time of inserting these records. These numbers are unique, so as to uniquely identify each record in the table, and are called "Surrogate Key" of the table.
As obvious, multiple surrogate keys may be related to the same item, however, each key will relate to one particular state of that item in time. In the above example, keys 2, 5 and 7 are all linked to "Veggie Burger" but they represent the state of the record in 3 different time spans. It's worth noting that there would be only one record with latest flag = "Y" among multiple records of the same item.
A slight variation of design of SCD Type 2 dimension is possible where we can store the version numbers of the records. The initial record will be called version 1 and as and when new records are generated, we will increment the version number by 1. In this design pattern, the records with highest version will always be the latest record. If we utilize this design in our earlier example, the dimension table will look like this:
KEY NAME TYPE_KEY PRICE DATE_FROM DATE_TO Version 1 Chicken Burger 1 3.70 01-Jan-11 19-Dec-11 1 2 Veggie Burger 1 3.20 01-Jan-11 14-Mar-11 1 3 French Fries 2 2.00 01-Jan-11 31-Dec-99 1 4 Twister Fries 2 2.20 01-Jan-11 31-Dec-99 1 5 Veggie Burger 1 3.25 15-Mar-11 19-Dec-11 2 6 Chicken Burger 1 3.80 20-Dec-11 31-Dec-99 2 7 Veggie Burger 1 3.20 20-Dec-11 31-Dec-99 3
Off course, we can also keep the "Latest Flag" column in the above table if we wish.
Again, if you do not know what is ETL - you can safely skip this section. But if you have some ETL background then I suppose you have already pin-pointed the fact that, unlike SCD Type 1, Type 2 requires you to insert new records in the table as and when any attribute changes. This is obviously different from SCD Type 1. Because in case of SCD Type 1, we were only updating the record. But here, we will need to update old record (e.g. changing the latest flag from "Y" to "N", updating the "Date To") as well as we will need to insert a new record.
Like before, we can use the "natural key" to first compare if the source record is existing in the target or not. If not, we will simply insert the record in the target with new surrogate key. But if it already exists in the target, we will have to check if any value of the attributes has changed between source and target - if not, we can ignore the source record. But if yes, we will have to update the existing record as "N" and insert a new record with new surrogate key. As I mentioned before, I will write a separate article on the ETL handling later.
SCD type 2, by design, tend to increase the volume of the dimension tables considerably. Think of this: Let's say you have an "employee" dimension table which you have designed as SCD Type 2. The employee dimensions has 20 different attributes and there are 10 attributes in this table which change at least once in a year on average (e.g. employee grade, manager's name, department, salary, band, designation etc.). This means if you have 1,000 employees in your company, at the end of just one year, you are going to get 10,000 records in this dimension table (i.e. assuming on an average 10 attributes change per year - resulting into 10 different rows in the dimension table).
As you can see, this is not a very good thing performance wise as this can considerably slow down loading of your fact table as you will require to "look up" this dimension table during your fact loading. One may argue that, even if we have 10,000 records, we will actually have only 1,000 records with Latest_Flag = 'Y' and since we will only lookup records with Latest_Flag = 'Y', the performance will not detoriate. This is not entirely true. While utilizing the Latest_Flag = 'Y' filter may decrease the size of the lookup cache, but database will generally need to do a full table scan (FTS) to identify latest records. Moreover, in many cases ETL developer will not be able to make use of Latest_Flag = 'Y' column if the transactional records do not always belong to the latest time (e.g. late arriving fact records or loading fact table at later point in time - month end load / week end load etc.). In those cases, putting latest_flag = 'Y' filter will be functionally incorrect as you should determine the correct return key on the basis of "Date To", "Date From" columns. (If you do not understand what I am talking about in this para, just ignore me for now. I am going to explain these things later in some other article)
As I mentioned before, type 3 design is used to store partial history. Although theoretically it is possible to use the type 3 design to store full history, that would be not possible practically. So, what is type 3 design? In Type 2 design above, we have seen that whenever the values of the attributes change, we insert new rows to the table. In case of type 3, however, we add new column to the table to store the history.
So let's say, we have a table where we have 2 column initially - "Key" and "attribute".
KEY ATTRIBUTE 1 A 2 B 3 C
If the record 1 changes its attribute from A to D, we will add one extra column to the table to store this change.
KEY ATTRIBUTE ATTRIBUTE_OLD 1 D A 2 B 3 C
If the record again change attribute values, we will again have to add columns to store the history of the changes
KEY ATTRIBUTE ATTRIBUTE_OLD ATTRIBUTE_OLD_1 1 E D A 2 B 3 C
Isn't then SCD Type 3 very cumbersome?
As you can see, storing the history in terms of changing the structure of the table in this way is quite cumbersome and after the attributes are changed a few times the table will become unnecessarily big and fat and difficult to manage. But that does not mean SCD Type 3 design methodology is completely unusable. In fact, it is quite usable in a particular circumstance - where we just need to store the partial history information.
Let's think about a special circumstance where we only need to know the "current value" and "previous value" of an attribute. That is, even though the value of that attribute may change numerous times, at any time we are only concerned about its current and previous values. In such circumstances, we can design the table as type 3 and keep only 2 columns - "current value" and "previous value" like below.
KEY Current_Value Previous_Value 1 D A 2 B 3 C
I can't find a very good example of this scenario right away, however, I can give you one example from one of my previous projects in telecom domain, wherein a certain calculated field in the report used to depend on the latest and previous values of the customer status. That calculated attribute was called "Churn Indicator" (churn in telecom business generally means leaving a telephone connection) and the rule to populate the churn indicator was (in a very very simplified way) like below:
Churn Indicator = "Voluntary Churn" (if customer's current status = 'Inactive' and previous status = 'Active') = "Involuntary Churn", (if customer's current status = 'Inactive' and previous status = 'Suspended')
As you can guess, in order to find out the correct value of churn indicator, you do not need to know complete history of changes of customer's status. All you need to know is the current and previous status. In this kind of partial history scenario, SCD Type 3 design is very useful.
Note here, compared to SCD Type 2, type 3 does not increase the number of records in the table thereby easing out performance concerns.
Now that we have already learnt about slowly changing dimensions, next we will discuss how to design "Rapidly Changing Dimension" or RCD