T E R R Y M 0 R I A R T Y A N D
A R T M 0 0 R E
Data Warehouse Must-Reads
December, 1996
Long time readers of this column know that near the holiday season I like to break from my regular column to explore some of the year's best data management books. This time Art Moore, my colleague at Spectrum Technologies Group, will join me in reviewing what we consider to be three of the industry's most important reads: Ralph Kimball's The Data Warehouse Toolkit, Rob Mattison's Data Warehousing, and Barry Devlin's Data Warehouse, From Architecture to Implementation.
If you've been perusing the book stores, you can't miss the explosion of data warehouse books that have hit the shelves recently. These volumes represent a new generation of data warehouse books. From the first generation of warehousing books, led by Bill Inmon and his coauthors, we learned the theories and architectures that raised our awareness of data warehouses. They gave us excellent introductions from which we could develop the business cases used to justify the initial investment .in data warehouse technology. The new generation of data warehouse books is founded on knowledge gained from implementing the first wave of data warehouses. A sense of "lessons learned" pervades each book, warning readers of potential pitfalls and hazards. In step with warehouse technology's maturation, these books emphasize analysis and design techniques, such as star models, data marts, and metadata management, that have emerged as useful in implementing a successful data warehouse project.
Although all these new data warehouse books have merit, we've chosen to give you a brief review of our favorite three.
Ralph Kimball delivers exactly what the title to his book promises: a toolkit for building data warehouses. This book is not only about what must be done to architect a data warehouse; it also provides practical advice on how to model and build one. Kimball begins by listing six goals of a data warehouse:
1. To provide access to corporate or organizational data
2. To provide consistent data
3. To allow the data to be separated and combined by means of every possible measure in the business
4. To comprise not just data but also a set of tools to query, analyze, and present information
5. To serve as a place to publish used data
6. To provide quality data that will drive business reengineering.
These goals, which set the book's context, are reinforced with every case study, example, and design principle Kimball presents. Most of the business rationale for a data warehouse business case can be extracted from this book, using these essential data warehouse goals.
Kimball states that 60 percent of a data warehouse effort focuses on the "back room" components of the central warehouse hardware, relational DBMS software, and the data. The remaining 40 percent of the effort focuses on "front room" components-the end user-oriented tools used to query, analyze, and present the data. The book follows these same proportions: Approximately 60 percent addresses data management issues, involving data modeling and database performance and sizing issues; about 40 percent explores the life cycle of a warehousing project and the user side of data warehousing. This portion of the book covers requirements for OLAP tools and provides scenarios that illustrate the types of business analysis problems these products must support.
Kimball is a master of "knowledge transfer by example." Rather than describing a concept with words and a few limited example diagrams, he develops comprehensive case studies through which to convey his knowledge of the subject to the reader. These case studies are drawn from his own experiences with several different industries. These studies serve as stepping stones, each building upon the design knowledgebase laid out in the previous one. To gain the fill benefit of the concepts Kimball describes, readers should go through the case studies sequentially.
The journey begins with a simple business analysis of the effectiveness of sales promotions by tracking how items at a grocery store are selling. This simple data warehouse design requires four dimensions (promotion, product, store, and time) and one fact table (sales). The case illustrates the basic steps necessary to design any data warehouse:
1. Choose the business process to be modeled
2. Choose the "grain" of the business process
3. Choose the relevant dimensions
4. Choose the "measured facts" that will populate the fact table.
The "grain" represents the fact table at its atomic level and is usually expressed in terms of time. For example, the grain of the grocery store promotion data warehouse is daily item movement, which represents detailed store sales per product per promotion per day. The grain is essential in determining the database's size and the dimensions required to support the end user's analysis.
Each of Kimball's case studies focuses on the particular aspect of a business the data warehouse is intended to address. The grocery store case, for example, explores the rules for aggregating business measures within the fact table. Kimball describes how some measures are fully additive across dimensions, semiadditive across some but not all of the dimensions, or nonadditive.
A warehouse inventory case study addresses the use of snapshot data to support inventory control and the incorporation of transaction-level data into a data warehouse. It also discusses the need for "degenerate dimensions," which provide a key for grouping line items together. Examples of degenerate dimensions include purchase orders, invoices, and bills of lading to which the business assigns a unique identifier to the entire transaction and groups the individual line items.
A shipment-processing case study provides an example of what Kimball considers the most powerful type of data warehouse. This warehouse lets users view all of a company's products, customers, and deals at once, providing some of the most essential financial and customer satisfaction measures.
The grocery store, warehouse, and shipment data warehouse examples are used to illustrate the "value chain" that exists between the demand and supply sides of business. The value chain can be represented by separate fact tables with shared dimensions. Kimball emphasizes that problems can occur if the dimension tables are not identical-for example, if manufacturing and retail use different product definitions for the same product.
Another case study from the financial service industry introduces the concepts of large dimensions, dirty dimensions, and heterogeneous products. A "dirty dimension" is one in which data quality cannot be guaranteed. For example, in most banks, account-oriented source applications contain data about the same customer multiple times. Many banks attempt to derive a "customer" by matching names and addresses across account applications, but this process results in more than one entry for each bank customer. Similarly, different attributes must be held for each of a bank's heterogeneous products. Attributes that are meaningful for a loan, such as term, credit risk assessment, and collateral, have no meaning for savings, checking, or investment products.
Kimball follows each case study with a database sizing based on some typical dimensions and fact table sizes he has encountered in developing real data warehouses for real companies. For many of the case studies he provides algorithms for deriving standard business metrics, such as gross profit, gross margin, days supply, and gross margin return on investment.
Kimball explains that "building a dimensional data warehouse is a process of matching the needs of the user community to the realities of the available data." His requirements-gathering approach involves comparing user expectations to source data reality through a series of interviews with end users and source application data experts. One section, entitled "The Content of End User Interviews," is alone worth the price of the book. It lists questions that can be used to focus and probe the business needs a data warehouse must support. Kimball provides a rationale for each question and suggests various follow-up questions to suit a number of user responses.
In a chapter on front-end applications, Kimball reaches a number of significant conclusions. He maintains that:
* SQL is inadequate to meet the types of analysis that data warehouse users must perform. (In fact, Kimball suggests extensions to SQL throughout the book.)
* End users and their tools should never see the actual data warehouse tables. Instead, an aggregate navigator must insulate the user presentation from database changes that can occur when aggregation dimensions are added to improve performance.
* Because of the complexities of using SQL to produce even the simplest business reports, warehouse users should not develop their own ad hoc reports. Instead, the data warehouse team should include MBA types with technical backgrounds to serve as liaisons with the business community in developing standard, precanned reports that can be customized and initiated with a few mouse clicks.
Throughout the text, Kimball summarizes the salient points of each topic discussed into design principles. Appendix A provides a "Cliffs Notes"-type version of these principles. A CD packaged with the book provides Star Tracker (an OLAP tool developed by the author) and populated Access tables for all the sample data warehouse databases presented in the text.
This book is incredibly hot-a mustread for anyone responsible for modeling and designing a data warehouse.
Ralph Kimball, The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses. John Wiley & Sons, 1996.
In his 1996 text, Data Warehousing Strategies, Technologies and Techniques, Rob Mattison has truly covered the subject. His treatise is comprehensive and pragmatic. Be advised that it's not light reading if you truly want to learn from it-but it's well worth the effort. Mattison takes the reader, in some detail, from project justification and scoping (he identifies step 1 as determining whether you really even need a data warehouse) to front-end applications development.
The book begins with an incisive expose of the systems-development industry's current state of complexity and uncertainty and the project estimation and execution practices in vogue. His section on the "Immutable Laws of Systems Development" is an extremely eloquent presentation of some basic IS development truths on a level with Frederick Brooks' The Mythical Man-Month. Because Mattison avoids none of the harsh realities and complexities of data warehouse development in today's business and systems environment, he provides readers with their best chance at success. He lays out all the gory details that must be considered, explains how they interrelate, and suggests how to address them.
To boost his audience's courage, Mattison takes a layered approach, leading readers bit by bit, with increasing detail, through all aspects of the total data warehouse development effort. He begins with a concise overview of what comprises a data warehouse, what goes into building one, and how and why he suggests you do so.
Mattison describes a data warehouse, at its highest level, as an infrastructure of both technical and operational layers upon which a set of applications is deployed. In turn, each application is "a contiguous assembly of Acquisition, Storage, and Access components [that], when taken in their entirety, provide the user with a functional set of usable data."
Each application is the result of a validated value proposition. A value proposition is a single, specific business problem with one identified sponsor (organization or person) and defined tangible benefits. I won't spoil it for you by saying too much more about Mattison's approach. Suffice it to say that incremental implementation is at the heart of his battle-scarred philosophy, and he makes a strong case for it.
After providing an overview, Mattison spirals back through each data warehouse component and process in greater detail. He breaks his coverage into discussions of operational infrastructure, physical infrastructure, applications development, and what he refers to as "data disciplines, which include data identification, sourcing, synchronization, transformation, validation, and data modeling. He discusses the latter only briefly, which may surprise you; however, this is a book about the total data warehouse from planning to implementation and support, not about how best to model the data.
After dragging readers through the trenches, Mattison returns to his overall warehouse planning model. Here he lays out his value proposition-based approach in detail and provides additional insights on validating, estimating, and ensuring appropriate sponsorship and focus for the data warehouse. He also gives advice on budgeting, bidding, and staffing.
This book is certainly not a simple read. In fact, as Mattison spirals through different aspects of the data warehousing project life cycle, you may find yourself needing to backpedal and regroup in order to keep the thread. Nevertheless, the content is valuable and, as Mattison himself says, executing a successful data warehousing project isn't easy. What Mattison offers is an in-depth analysis of the problem domain and the various components you must consider and shape into your own warehousing project. His incremental value-proposition approach offers a solid, well-presented framework for proceeding with the overall effort. Whether or not you fully agree with Mattison's approach, his insights for establishing clear sponsorship, scoping, and business benefits for data warehouse initiatives are nothing less than invaluable.
Oh and by the way, if that's not enough, the entire last third of the book is devoted to various aspects of data mining. These chapters, each written by a different contributing author, cover topics from multidimensional analysis to neural nets and data visualization.
Rob Mattison, Data Warehousing: Strategies, Technologies and Techniques. McGraw-Hill, 1996.
Barry Devlin's Data warehouse, From ,architecture to Application is one of the latest data warehouse books to hit the market. On the surface, its structure is similar to many of the other data warehouse books currently available: history, architecture, project life cycle. What makes this book different is its emphasis on data management issues from an enterprisewide perspective. Devlin deals with issues surrounding the distribution of consistent and integrated data (including metadata) to a degree that I have not encountered in any other data warehouse book. He provides a data administrator's view of the data warehouse.
Devlin gives a detailed description of a conceptual data warehouse architecture based on three data layers: real-time, reconciled, and derived. Operational systems are the creators and primary users of realtime data. Derived data includes copies of operational data as well as any data derived from that snapshot data. Derived data is what we commonly expect to store in the data warehouse databases. However, Devlin argues that a third layer of data must exist if we are to manage the data resource properly. This layer, called reconciled data, exists to support the integration of disparate data from operational systems.
To achieve the benefits associated with the proposed data warehouse architecture, you must adopt an enterprisewide view of data, which you can only achieve through a coordinated approach to data modeling. Devlin proposes several levels of modeling. A generic entity-relationship model serves as the basis for logical application views. Logical application views are then transformed into physical data designs. Data model levels are mapped to the data warehouse architecture layers.
Although other data warehouse experts warn against using a data warehouse project as an avenue for developing an enterprise model, I've always wondered how a successful data warehouse environment can be developed without that corporatewide view. I was happy to hear Devlin voice the view that a "data warehouse design is a particularly appropriate area for the use of enterprise-level modeling, because both the data view and the data scope of the enterprise modeling are consistent with the needs of the warehouse."
I also agree with his definition of an enterprise data model as "a consistent definition of all the data common to the business, from a high-level business view to the physical database design." He provides a brief explanation of the data classifications that are incorporated into the generic entity-relationship model, and he includes a type hierarchy for "Involved Party" and "Product." This classification scheme illustrates how a general concept (such as Product) can provide an infrastructure through which an organization's business (loan products, for example) is specified. Devlin writes: "This commonality must be recognized at a high level in the business to ensure that data at lower levels in the model [is] correctly related." The generic data classifications are consistent with the metalevel models that are essential if organizations are to integrate their disparate operational data into data that can be used to determine the health of the organization.
Information resource management professionals have a strong ally in Barry Devlin. His book is a strong weapon in our fight to prevent the proliferation of stovepipe data warehouses throughout our organizations.
Barry Devlin, Data Warehouse, From Architecture to Implementation Addison-Wesley, 1996.
Terry Moriarty, president of Inastrol, a San Francisco-based information management consultancy, specializes in customer relationship information and metadata management. Her common business models have been used as the basis of customer models for companies within the financial services, telecommunication, software/hardware technology manufacturing, and retail consumer product industries. You can reach her at terry@inastrol.com.
Art Moore is a senior consultant and manager of national accounts for Spectrum Technology Group Inc. He can be reached at AMoore@Spectrumtech.com.