Practical Data Architecture

Random thoughts on data modeling and database design, by Bruce Worobec

Archive for August, 2008

Building a Data Warehouse Step-by-Step

Posted by Bruce Worobec on August 13, 2008

If it were only that easy.  Yes, I will outline the steps necessary to build out a Data Warehouse.  But that is only part of the story.  Depending on your definition of failure, many, if not most Data Warehouse efforts do not make the grade.  Why is that?

 

I have boiled it down to two factors:  Organization Will and Organizational Maturity.  Absent either of the two, a Data Warehouse effort will likely not succeed.  Throughout the presentation of the steps I will intersperse checkpoints against these two factors.  If your organization cannot pass the checkpoints I would recommend against executing the steps.

 

One popular alternative is to build independent Data Marts.  A key mantra in Data Warehousing is “a single source of the truth.”  The independent Data Mart efforts will each come to their own conclusion of what is a Product, or what is a Customer is?  “My Mart says we have 1258 Customers.”  Well, my Mart says we have 1301 Customers.”  Who is right?  Who knows?  Factor in the time spent attempting to correlate results across two or more of these independently built Marts and the sum of the ROI of the individual efforts will be eroded, if not turned negative.

 

If you are not sufficiently discouraged, here are the steps to build an integrated Data Warehouse:

 

  1. Build an Enterprise Conceptual Data Model
  2. Identify potential Facts and Dimensions
  3. Initiate a Data Stewardship Program
  4. Determine an iterative implementation plan for Facts and Dimensions
  5. For each iteration:

·        Identify data sources for each Fact and Dimension

·        Develop ETL to populate the Facts and Dimensions

  1. Develop a meaningful set of reports, cubes and dashboards based on the data available under a coordinated Business Intelligence (BI) program

 

Let’s take a look at each step in more detail.  Rather than reinventing the wheel here, I have included inline links to relevant articles.

 

Build an Enterprise Conceptual Data Model

 

Many practitioners recommend against this step.  I think this comes from past experiences where typical Data Architects have tried to develop too perfect a model.  A comprehensive model takes too long and frustrates the business experts participating in the exercise.  A quick but skillfully-executed modeling exercise will galvanize the executive team behind the warehouse effort and will save rework and false steps down the road.  Give me a half dozen key executives and a similar number of operational experts from across the business willing to sit through 4 or 5 facilitated sessions spread out over a few weeks, with each session lasting 3-4 hours, and I will produce a conceptual mode sufficient to guide a data Warehouse deployment.

 

Checkpoint—Organizational Maturity:  Are the key players identified for the modeling exercise able to step out of their individual roles and agree on the data entities, definitions, and key performance indicators necessary for the future needs of the enterprise?  Even if it means recognizing current operational data and process shortcomings in their area of responsibility?  If the answer is no then there are bigger problems than not having a Data Warehouse that need to be fixed first.

 

Checkpoint—Organizational Will:  Yes, I am asking for a chunk of time from the key players in an organization, not to mention the scheduling nightmares.  The sessions are best done offsite to minimize distractions and do leave your Crackberry at the door.  The CEO typically needs to step in and ensure this is among everyone’s top priorities.  Can’t make it happen—forget about an integrated Warehouse that meets expectations.

 

Identify potential Facts and Dimensions

 

With the Conceptual Data Model in hand, the key Facts and Dimensions needed in the warehouse will jump off the page.  Common entities with relationships across the model, such as Customer or Product are obvious Dimensions.  And look for transactional entities like Order Line Item or Support Incident and you will find key Facts.

 

Initiate a Data Stewardship Program

 

Checkpoint—Organizational Will and Maturity:  Data Stewardship is not easy.  It foists responsibility on functional areas of the business and insists on rigorous processes at the point when new data is created.  If anyone can create a new Product, or Customer, or Geography, based on their own definition, then no Warehouse could ever consistently count or aggregate based on these entities.

 

One of the often-cited reasons for building a Data Warehouse is consistency.  Consistency in a Warehouse requires Conformed Dimensions and Facts.  Without Data Stewardship, there is little chance of implementing Conformed Dimensions or Facts.  And without Conformed Dimensions and Facts, there is no chance of ever getting to a single source of the truth.  Does this sound important?  It is essential.

 

A side note:  Earlier I mentioned an alternative approach of building a series of independent Data Marts that was ripe for inconsistency.  Some practitioners suggest that the independent Data Marts can be built with Conformed Dimensions and Facts.  While that does solve the consistency problem, I find it a semantic curiosity.  What is a Data Warehouse if not simply a series of Data Marts implemented with Conformed Dimensions and Facts?  If you feel compelled to set me straight—please don’t.  You’re the same guy or gal that wants to build Conceptual Data Models that are too complex.

 

Determine an iterative implementation plan for Facts and Dimensions

 

There is no “Chicken or the Egg” question in Data Warehousing.  First you bring in the data, and then you can produce reports.  As each report is produced, you will learn more about the data.  This learning may influence what data to bring in next or cause you to take another look at data already in the Warehouse—it is an evolution.

 

If I try to bring in all possible data all at once it will take a long time until I see my first report.  And that first report may uncover that my understanding of the source data was flawed.  If I bring in data a little at a time I will produce some really uninteresting reports.  Either way, I risk losing critical momentum.  The answer is to carefully plan a series of iterations where each remains manageable but yet still delivers some incremental value to the business.

 

File this under not biting of more than you can chew without biting off so little that you go hungry. 

 

For each iteration:  Identify data sources for each Fact and Dimension

 

Sometimes called Source System Analysis or Source-to-Target Mapping, this is where you figure out where to go and get the data needed to populate the Warehouse.

 

For each iteration:  Develop ETL to populate the Facts and Dimensions

 

This is where you write the code that grabs the data from its source and populate the Warehouse.  During design and development the analysts involved are going to have questions—lots of questions.  The answers they get need to be consistent with the definitions in the Conceptual Data Model and content and quality guidelines being developed by the Data Stewards.

 

Checkpoint—Organizational Will and Maturity:  Subject matter experts from the business will get tired of answering questions.  They will have to help untangle anomalies in transactional data they never imagined existed.  Sometimes the best solution will be to fix the transactional data.  Other times, a decision will need to be made that accuracy at the margin will be sacrificed.  Is the business willing to partner in the analysis and accept the limitations necessary for a Warehouse?

 

Another side note:  The series of Data Marts approach is unlikely to optimize ETL.  Each time I write ETL code for a new Data Mart I may be going back to the same source data needed in an earlier Mart.  This new code may implement slightly different rules or logic.  A design principle for a planned Data Warehouse would be to go to each source system once with a single stream of ETL, thereby improving processing efficiency and eliminating the potential for logic inconsistencies.

 

Develop a meaningful set of reports, cubes and dashboards based on the data available under a coordinated Business Intelligence (BI) program

 

For the first couple of iterations these initial deliverables will usually not be that impressive.  Interesting reporting results require bringing data from multiple sources together.  It may take several development iterations to build the critical mass of data necessary to answer the really good questions

 

Checkpoint—Organizational Will:  Depending on the quality and quantity of development resources available for each iteration, building the Data Warehouse will take time.  The first iteration will show some ROI, but would be tough to justify on its own.  This is especially true if you add on the up front modeling exercise.  Will the organization be willing to fund the ongoing Warehouse effort as building a long-term asset?  Or will the effort be treated like an expensed project subject to continual justification and possible termination?

 

Yeah. Yeah. Yeah.  That is all well and good.  But when do I get my reports?

Posted in Data Modeling, Data Warehouse | Tagged: | 1 Comment »

Email Marketing Modeling Session Results

Posted by Bruce Worobec on August 12, 2008

Here is a sample PowerPoint deck showing the typical results of about 8 hours of Business and Data Modeling sessions following my usual approach.

email-marketing-modeling-session-results

Posted in Data Modeling, Real World Examples | Tagged: , | Leave a Comment »

Case Study: West Coast Bargain Books

Posted by Bruce Worobec on August 12, 2008

 Early in 2004, West Coast Bargain Books was looking for someone to upgrade a number of home grown MS Access databases to SQL Server to solve some performance issues.  They thought it would be about a two week effort.  While validating the scope of the project, it quickly became apparent that the performance issues were related to more fundamental problems.  Since no data modeling had ever been done, the multiple Access databases were inconsistent and suffered from major design flaws.  The recommendation was to not upgrade the existing system to SQL Server since few of the current problems would be solved.

Faced with a flawed system and no upgrade path, they asked what they could do.  The philosophy of Worobec Consulting has always been to buy something off the shelf if it meets your needs.  If no system exists in the marketplace and significant competitive advantage can be gained by building a customized solution, then go for it.  In either case, build or buy, you need a couple of high-level models to help make the right decisions early in the project.

We conducted a two-day facilitated session to product a Business Function Model (BFM) and a Conceptual Data Model (CDM).  These two models are necessary whether you are building or buying.  If buying, the BFM can serve as a functional checklist to evaluate potential systems and the CDM can be compared to the database design to determine whether all data identified in the modeling session is supported.  If building, the two models help to define scope and form the basis of the conceptual design.  The two models produced during the session are shown below.

Business Function Model

Main Function:  Buy and Sell Books and Related Media Products

Buy books

Find vendor

Determine books to carry

Evaluate / negotiate a deal

Place a purchase order

Arrange shipping

Receive books

Create space / location

Sort books

Locate books

Record physical attributes (remainder mark, condition, etc)

Offer books for sale

Allocate books to a channel (reserved or not)

Price books (subjective or calculated)

Generate list / sample / show

Receive an order

Record

Who

What

Where to send

How to send

When to send

Meet customer needs (no peanuts, sticker books etc)

Approve credit / payment

Fulfill an order

Generate pull sheet

Pull books

Prep books

Pack books

Determine freight requirements

Finalize invoice

Collect payment

Process returns and claims

Conceptual Data Model

http://bruceworobec.wordpress.com/2008/08/12/case-study-west-coast-bargain-books/wcbb/

Armed with the results of the modeling session,  WCBB went of to search the marketplace for  a suitable system.  The systems they found either did not fully meet their needs or were too expensive.  After much debate the decision was to build a custom solution.  SQL Server was chosen as the database and Access was chosen as the front end to leverage the expertise they had in house.  The system went live in January, 2006.  The finished system took approximately 600 hours of Worobec Consulting time and probably twice that number of hours spent by their internal resources.  An additional project objective was to train their staff to perform ongoing support and database administration of the system.

Posted in Data Modeling, Database Design, Real World Examples | Tagged: , | Leave a Comment »

A Guide for Participants of Facilitated Sessions

Posted by Bruce Worobec on August 12, 2008

 

What follows is a document I have been using since the early 1990’s.  The terminology may need some updating, but for the most part the core techniques have proven themselves timeless.

 

WHAT IS A FACILITATED SESSION?

 

In many ways a facilitated session is simply a meeting.  It involves a group of people brought together to accomplish a specific agenda.  However, unlike a typical meeting, a facilitated session utilizes an individual trained in techniques for conducting such sessions in order to insure completion of the agenda.  Also, a facilitated session is longer than most meetings, often with a duration of several days.

 

Facilitated sessions can be used for many purposes, from strategic planning to building consensus among a group in conflict.  When used for data modeling, the approach followed during the session is based on the Information Engineering approach to systems development.  In this approach, systems are developed based on several models.  Of all the models, the two of interest at the beginning of a project are the Business Function Model and the Conceptual Data Model.

 

The reason these two models are important is that the Business Function Model serves as the basis for application functional design while the Conceptual Data Model serves as the basis for database design.  Also, through the building of these models, the project will have the potential to gain efficiency by identifying data and processes that have crept into systems over time but are unrelated to true business functions. 

 

The first model to be produced during the facilitated session will be the Business Function Model.  This model will capture “what” your business does in plain English, but will ignore “how” things are done.  The distinction between what and how will be a theme repeated throughout the session.  By focusing on what the business needs to do, the group will not be limited by any individual’s perception of how thing work today or how things ought to work in the future.

 

Once the Business Function Model is mostly complete, the focus will turn towards producing the Conceptual Data Model.  The technique used is to ask the following question for each of the lowest level functions in the Business Function Model:  “What data is needed to perform this function?”   The results are captured in an Entity-Relationship (E-R) diagram.  The E-R diagram, along with text descriptions for the entities, attributes, and relationships, then forms the completed Conceptual Data Model.

 

Because the data questions will usually uncover some changes or additions to the Business Function Model, considerable time can be spent refining both the models simultaneously.  Once complete, the two models can then be used as tools to address scope issues, examine technical alternatives, and to assure common understanding between Business Experts and Technical Experts from the beginning of a project.  The models also provide a solid foundation for physical data design.

 

PARTICIPANT ROLES

 

In order for a facilitated session to be successful, the participants all must understand and adhere to their respective roles in the process.  The roles of Business Expert, Technical Expert, and Facilitator are defined as follows:

 

Business Expert:  The Business Expert participants are the experts for the part of the business under analysis.  They must have the knowledge and authority to define what the business needs to do in order to operate efficiently.  Their responses during the session will impact database and process design.  If questions are raised that cannot be answered with the expertise in the room, a Business Expert will be assigned responsibility for researching the issue within their organization.  Most importantly, a Business Expert must not withhold information because they assume the resulting technical implementation would not be feasible.

 

Technical Expert:  The Technical Expert participants are mainly observers during the modeling process.  They should only offer clarifying comments, not proposed solutions.  Even if the information gathered points towards an implementation that is not technically feasible, the Technical Experts must hold their comments.  Once the models are complete, the Technical Experts can then address scope and feasibility issues and explore technical alternatives in conjunction with the Business Experts.

 

Facilitator:  The Facilitator is the leader of the session and is ultimately responsible for delivering what the session set out to accomplish.  Unlike a neutral “Consensus” facilitator, the Facilitator role is biased towards modeling objectives.  The Facilitator will influence the format, but not the content, of the deliverables so that the data modeling objectives can be met and the subsequent physical data designs can be optimized.

 

GUIDELINES FOR BUSINESS FUNCTION MODELING

 

The business function modeling part of the session will be conducted using a set of guidelines designed to keep the session focused on delivering a quality Business Function Model.  The guidelines that follow will be explained in more detail at the beginning of the session:

 

     Functions define WHAT, not HOW

     Functions start with a precise verb

     Functions are defined in plain English–no jargon

     Functions deliver something

     Order of functions is not important

     Functions may be optional

     Functions usually break into 4-8 sub-functions

     Sub-functions must completely define the parent function


GUIDELINES FOR DATA MODELING

 

The Data Modeling part of the session will be conducted using a set of guidelines designed to insure delivery of a quality Data Model.  Depending on the experience of the group, a brief training session on data modeling will be included.  Also, some new terminology will be introduced which is briefly defined below:

 

Entity:

Something the business has the will and means to keep information about

Attribute:

A single piece of information about an entity

Relationship:

An association of two entities for a business purpose

 

Some of the guidelines that will be further explained at the beginning of the logical data modeling are as follows:

 

     Entities must have a unique identifier

     All entities, attributes, and relationships should be fully documented

     The model should only contain data needed to support the business functions

     The underlying relational model should be in third normal form

 

FACILITIES

 

Because of the duration and intensity of the sessions, comfortable surroundings are essential for success.  Also, the session will require the complete and undivided attention of all participants.  In order to accomplish this, a location away from the normal workplace is preferred.  The ideal session location would contain the following:

 

     Lots of whiteboard space

     Easels with lots of paper, marking pens, and masking tape

     A laptop computer with word processing software

     Toys and frequent snacks

     People in casual clothing

 

FINAL CAVEATS

 

Depending upon which consultants, methodologies, and tools you have been exposed to, the definitions and notations used in the session may vary from what you have seen in the past.  Experience has shown that the choice of definitions or notations has had little impact on the quality of the session.  If in doubt, please ask for clarification so that all participants can all be working from the same definitions.  Also, be prepared for times when things are going well, and other times when things fall flat.  It will be critical that everyone police themselves with regard to placing one’s own needs and biases on hold for the good of the session.  Most of all, in some sort of bizarre way, you will need to view the session as fun.

Posted in Real World Examples | Tagged: , , | Leave a Comment »

Please allow me to introduce myself…

Posted by Bruce Worobec on August 12, 2008

I have spent the bulk of my twenty year career building data models and designing databases.  My slant is towards the practical–if you are looking to debate the merits of various methodologies and technologies you’ve come to the wrong place.  Over time I will be sharing ideas and examples of what is necessary and sufficient for successful implementations and throwing rocks at esoteric fluff.

Posted in Start Here | Leave a Comment »