This paper outlines some of the most important (and equally neglected) things that one must consider before and during the design phase of a data warehouse. In our experience, we have seen data warehouse designers often miss out on these items merely because they thought them to be too trivial to attract their attentions. Guess what, at the end of the day such neglects cost them heavily as they cut short the overall ROI of the data warehouse.
Here we outline some data warehouse gotchas that you should be aware of.
In a top-down design approach people often start to visualize the end data and realize the complexity associated with data analytics first. As they tend to see more details of it, they tend to devote more time for designing of the analytical or reporting solutions and less time for the designing of the background ETL staffs that deal with data extraction / cleaning / transformation etc. They often live under the assumption that it would be comparatively easy to map the source data from the existing systems since users already have better understanding on the source systems. Moreover, the need and complexity of cleansing / profiling of the source data would be less since the data is already coming from standard source systems.
Needless to say, these assumptions often turn void when it comes to actually coding the ETL layer to feed the data warehouse. Almost always, mapping, cleaning and preparing data turns out significantly more time consuming compared to design of Reporting / Analytics layer.
From budgeting and costing standpoints also, an architect prefers to choose the case of data reporting and analytics over background ETL as the former can be more easily presented to the senior management over the later in order to get them sanction the budget. This leads to disproportionate budget between background ETL and frontend Reporting tasks.
Users often do not know what they want from the data until they start to see the data. As and when development progress and more and more data visualization becomes possible, users start wishing even more out of their data. This phenomenon is unavoidable and designers must allocate extra time to accomodate such ad-hoc requirements.
Many requirements that were implicit in the beginning becomes explicit and indispensable in the later phase of the project. Since you can not avoid it, make sure that you already have adequate time allocated in your project plan before hand.
The power of an integrated data warehouse becomes apparent when you start discovering discrepancies and issues in the existing stable (so-called) source systems. The real problem, however, is - designers often make the wrong assumption that the source systems or upstream systems are fault free. And that is why they do not allocate any time or resource in their project plan to deal with those issues.
Data warehouse developers do discover issues in the source systems. And those issues take lot of time to get fixed. More than often those issues are not even fixed in the source (to minimize the impact on business) and some work around is suggested to deal with those issues in the data warehouse level directly (although that is not generally a good idea). Source system issues confuse everybody and require more administrative time (that technical time) to resolve as DW developers need to identify and make their case to prove it to the source systems that the issue(s) does exist. These are huge time wasters and often not incorporated in the project plan.
Source systems do not always give you the correct data. A lot of validations and checks are not done in the source system level (e.g. OLTP systems) and each time a validation check is skipped, it creates danger of sending unexpected data to the data warehouse level. Therefore before you can actually process data in data warehouse, you will require to perform some validation checks at your end to ensure the expected data availability.
This is again unavoidable. If you do not make those checks that would cause issues at your side which include things like, data loading error, reconciliation failure even data integrity threats. Hence ensure that proper time and resource allocation are there to work on these items.
You would face a natural resistance from the existing business users who would show huge inertia against the acceptance to the new system. In order to ease the things, adequate user training sessions are generally arranged for the users of the data warehouse. But you will notice that "adequate" training is not "sufficient" for them (mainly due to they need to unlearn a lot of things to learn the use of the new data warehouse).
Even if you arrange adequate training to the users, you would find that the users are not really putting their training to use when it comes to doing things in the new data warehouse. That's often because facts and figures from the new data warehouse often challenge their existing convictions and they are reluctant to accept it whole heartedly.
User training and acceptance is probably the single most important non-technical challenge that makes or breaks a data warehouse. No matter what amount of effort you put as a designer to design the data warehouse - if the users are not using it - the data warehouse is as good as failure. As the old saying goes in Sanskrit – “a tree is known by the name of its fruit“, the success of data warehouse is measured from the information it produces. If the information is not relevant to the users and if they are reluctant to use it - you lost the purpose. Hence make all the possible efforts to connect to the users and train them to use the data warehouse. Mere 'adequate' training is not 'sufficient' here.
That is because the users often belong to different departments of the company and even though each one of them knows the business of her department pretty well, she would not know the business of the other department that well. And when you take the data from all these departments and try to combine them together into an integrated data warehouse, you would often discover that business rule suggested by one user is completely opposite to the business rule suggested by the other.
Such cases are generally involved and need collaboration between multiple parties to come into the conclusion. It's better to consider such cases way before during the planning phase to avoid the late surprises.
A very minutely done volumetric estimate in the starting phase of the project would go weary later. This happens due to several reasons e.g. slight change in the standard business metrics may create huge impact on the volumetric estimates.
For example, suppose a company has 1 million customers who are expected to grow at a rate of 7% per annum. While calculating the volume and size of your data warehouse you have used this measure in several places. Now if the customer base actually increase by 10% instead of 7%, that would mean 30000 more customers. In a fact table of granularity customer, product, day - this would mean 30000 X 10 X 365 more records (assuming on average one customer use 10 products). If one record takes 1kb, then the fact table would now require - (30000 X 365 X 10 X 1kb ) /(1024 X 1024) = 100+ GB more disk space from only one table.
When user look at one value in your report and says, "I think it's not right" - the onus is on you to prove the correctness or validity of that data. Nobody is going to help you around to prove how right your data warehouse is. For this reason, it is absolutely necessary to build a solid data reconciliation framework for your data warehouse. A reconciliation framework that can trigger an early alarm whenever something does not match between source and target, so that you get enough time to investigate (and if required, fix) the issue.
Such reconciliation framework however indispensable is not easy to create. Not only they require huge amount of effort and expertise, they also tend to run in the same production server almost same time as that of production load and eat up lot of performance. Moreover, such reconciliation framework is often not a client side requirement - making it even difficult for you to allocate time and budget. But if you do not do it that would be a much bigger mistake to make.
Apart from development and deployment, maintenance also incur huge cost in data warehousing. Server maintenance, software licensing, regular data purging, database maintenance all these incur costs.
It's important to set the expectation in the very beginning of the project about the huge maintenance cost implications.
You need to load data in the data warehouse, generally at least daily (although sometimes more frequently than this) and also monthly / quarterly / yearly etc. Loading latest data into data warehouse ensures that your reports are all up-to-date. However, the time required to load data (refresh time) is going to be more than what you have calculated and that's too going to increase day by day.
One of the major hinderances in the acceptance of a data warehouse by its users is its performance. I have seen too many cases where reports generated from data warehouse miss SLA and severely damage the dependency and credibility of the data warehouse. In fact, I have seen cases where daily load runs more than a day and never completes to generate timely daily report. There have been other famous cases of SLA breach as well.
I can not stress this enough but performance considerations are hugely important for the success of a data warehouse and it's more important than what you thought. Do everything necessary to make your data warehouse perform well - reduce overhead, maintain servers, cut-off complexities, do regular system performance tests (SPT) and weigh the performance against industry benchmarks, make SPT a part of user acceptance test (UAT) etc.