If it were only that easy. Yes, I will outline the steps necessary to build out a Data Warehouse. But that is only part of the story. Depending on your definition of failure, many, if not most Data Warehouse efforts do not make the grade. Why is that?
I have boiled it down to two factors: Organization Will and Organizational Maturity. Absent either of the two, a Data Warehouse effort will likely not succeed. Throughout the presentation of the steps I will intersperse checkpoints against these two factors. If your organization cannot pass the checkpoints I would recommend against executing the steps.
One popular alternative is to build independent Data Marts. A key mantra in Data Warehousing is “a single source of the truth.” The independent Data Mart efforts will each come to their own conclusion of what is a Product, or what is a Customer is? “My Mart says we have 1258 Customers.” Well, my Mart says we have 1301 Customers.” Who is right? Who knows? Factor in the time spent attempting to correlate results across two or more of these independently built Marts and the sum of the ROI of the individual efforts will be eroded, if not turned negative.
If you are not sufficiently discouraged, here are the steps to build an integrated Data Warehouse:
- Build an Enterprise Conceptual Data Model
- Identify potential Facts and Dimensions
- Initiate a Data Stewardship Program
- Determine an iterative implementation plan for Facts and Dimensions
- For each iteration:
· Identify data sources for each Fact and Dimension
· Develop ETL to populate the Facts and Dimensions
- Develop a meaningful set of reports, cubes and dashboards based on the data available under a coordinated Business Intelligence (BI) program
Let’s take a look at each step in more detail. Rather than reinventing the wheel here, I have included inline links to relevant articles.
Build an Enterprise Conceptual Data Model
Many practitioners recommend against this step. I think this comes from past experiences where typical Data Architects have tried to develop too perfect a model. A comprehensive model takes too long and frustrates the business experts participating in the exercise. A quick but skillfully-executed modeling exercise will galvanize the executive team behind the warehouse effort and will save rework and false steps down the road. Give me a half dozen key executives and a similar number of operational experts from across the business willing to sit through 4 or 5 facilitated sessions spread out over a few weeks, with each session lasting 3-4 hours, and I will produce a conceptual mode sufficient to guide a data Warehouse deployment.
Checkpoint—Organizational Maturity: Are the key players identified for the modeling exercise able to step out of their individual roles and agree on the data entities, definitions, and key performance indicators necessary for the future needs of the enterprise? Even if it means recognizing current operational data and process shortcomings in their area of responsibility? If the answer is no then there are bigger problems than not having a Data Warehouse that need to be fixed first.
Checkpoint—Organizational Will: Yes, I am asking for a chunk of time from the key players in an organization, not to mention the scheduling nightmares. The sessions are best done offsite to minimize distractions and do leave your Crackberry at the door. The CEO typically needs to step in and ensure this is among everyone’s top priorities. Can’t make it happen—forget about an integrated Warehouse that meets expectations.
Identify potential Facts and Dimensions
With the Conceptual Data Model in hand, the key Facts and Dimensions needed in the warehouse will jump off the page. Common entities with relationships across the model, such as Customer or Product are obvious Dimensions. And look for transactional entities like Order Line Item or Support Incident and you will find key Facts.
Initiate a Data Stewardship Program
Checkpoint—Organizational Will and Maturity: Data Stewardship is not easy. It foists responsibility on functional areas of the business and insists on rigorous processes at the point when new data is created. If anyone can create a new Product, or Customer, or Geography, based on their own definition, then no Warehouse could ever consistently count or aggregate based on these entities.
One of the often-cited reasons for building a Data Warehouse is consistency. Consistency in a Warehouse requires Conformed Dimensions and Facts. Without Data Stewardship, there is little chance of implementing Conformed Dimensions or Facts. And without Conformed Dimensions and Facts, there is no chance of ever getting to a single source of the truth. Does this sound important? It is essential.
A side note: Earlier I mentioned an alternative approach of building a series of independent Data Marts that was ripe for inconsistency. Some practitioners suggest that the independent Data Marts can be built with Conformed Dimensions and Facts. While that does solve the consistency problem, I find it a semantic curiosity. What is a Data Warehouse if not simply a series of Data Marts implemented with Conformed Dimensions and Facts? If you feel compelled to set me straight—please don’t. You’re the same guy or gal that wants to build Conceptual Data Models that are too complex.
Determine an iterative implementation plan for Facts and Dimensions
There is no “Chicken or the Egg” question in Data Warehousing. First you bring in the data, and then you can produce reports. As each report is produced, you will learn more about the data. This learning may influence what data to bring in next or cause you to take another look at data already in the Warehouse—it is an evolution.
If I try to bring in all possible data all at once it will take a long time until I see my first report. And that first report may uncover that my understanding of the source data was flawed. If I bring in data a little at a time I will produce some really uninteresting reports. Either way, I risk losing critical momentum. The answer is to carefully plan a series of iterations where each remains manageable but yet still delivers some incremental value to the business.
File this under not biting of more than you can chew without biting off so little that you go hungry.
For each iteration: Identify data sources for each Fact and Dimension
Sometimes called Source System Analysis or Source-to-Target Mapping, this is where you figure out where to go and get the data needed to populate the Warehouse.
For each iteration: Develop ETL to populate the Facts and Dimensions
This is where you write the code that grabs the data from its source and populate the Warehouse. During design and development the analysts involved are going to have questions—lots of questions. The answers they get need to be consistent with the definitions in the Conceptual Data Model and content and quality guidelines being developed by the Data Stewards.
Checkpoint—Organizational Will and Maturity: Subject matter experts from the business will get tired of answering questions. They will have to help untangle anomalies in transactional data they never imagined existed. Sometimes the best solution will be to fix the transactional data. Other times, a decision will need to be made that accuracy at the margin will be sacrificed. Is the business willing to partner in the analysis and accept the limitations necessary for a Warehouse?
Another side note: The series of Data Marts approach is unlikely to optimize ETL. Each time I write ETL code for a new Data Mart I may be going back to the same source data needed in an earlier Mart. This new code may implement slightly different rules or logic. A design principle for a planned Data Warehouse would be to go to each source system once with a single stream of ETL, thereby improving processing efficiency and eliminating the potential for logic inconsistencies.
Develop a meaningful set of reports, cubes and dashboards based on the data available under a coordinated Business Intelligence (BI) program
For the first couple of iterations these initial deliverables will usually not be that impressive. Interesting reporting results require bringing data from multiple sources together. It may take several development iterations to build the critical mass of data necessary to answer the really good questions
Checkpoint—Organizational Will: Depending on the quality and quantity of development resources available for each iteration, building the Data Warehouse will take time. The first iteration will show some ROI, but would be tough to justify on its own. This is especially true if you add on the up front modeling exercise. Will the organization be willing to fund the ongoing Warehouse effort as building a long-term asset? Or will the effort be treated like an expensed project subject to continual justification and possible termination?
Yeah. Yeah. Yeah. That is all well and good. But when do I get my reports?