Heart of the Warehouse
November, 1996The recent increase in industry competition has spawned increased pressure on IT organizations to provide increased quantities of enterprise data for increased trend and historical analysis of their businesses. However, access to enterprise data is not enough to make a data warehouse environment useful to an organization's knowledge workers. Warehouse users also need a business context for this data if they are to understand its meaning and its significance to the enterprise. They need encoded data, derivation rules, data summarization rules, data dependency rules, and data integration transformation rules-information known as meta-data.
The development of data warehouses has brought with it the need for corresponding meta-data to be available to the business community. Unlike end users of operational environments, who have minimal need to access meta-data, warehouse knowledge workers use meta-data as the primary mechanism for developing their queries and determining warehouse data quality. A data warehouse's effectiveness is directly related to the ease with which its users-both business and technical-can access the meta-data documenting its contents. This meta-data must be developed and maintained, just as the "real" warehouse data is developed and maintained.
The data warehouse functional division framework presented in last month's column provides a context for describing the data warehouse environment's architecture Within this framework, the Meta-data functional division serves as a foundation for the other four divisions: Source, Load, Storage, and Query. A corresponding framework is necessary to describe the architecture for the warehouse's meta-data management environment where warehouse meta-data is created and maintained (see Figure 1). Each architectural component within this framework represents a meta-data source and/or meta-data user. Consequently, each tool incorporated into a data warehouse infrastructure has the potential to create and use meta-data.

Traditionally, application libraries containing program source code and database specifications house meta-data about the operational production systems. Computer-Aided Software Engineering (CASE) tools maintain meta-data about applications under development. The advent of data warehouses has introduced new types of tools within which warehouse-specific meta-data can be created and maintained, such as tools that support data extraction and transformation.
CASE tools still serve as the (entity-relationship and dimensional) modeling arena for data in existing operational production systems and now also serve the same purpose for the target warehouse environment. They document business rules specific to validation, cleansing, extraction, and transformation processes within a warehouse development strategy. CASE tool process-modeling components can provide an arena for transformation specifications, models of existing or future decision-support business functions, and the warehouse development methodology itself Thus, CASE tools can be providers and users of warehouse-relevant meta-data, and interfaces among the tool types-together with a centrally managed repository toolset-can escalate meta-data sharing.
The data warehouse meta-data management environment framework depicts the types of tools within which warehouse-relevant meta-data may be created and maintained. The framework can:
* Identify sources for an inventory of existing warehouse-relevant meta-data
* List the tools an enterprise already owns that can be leveraged for warehouse development initiatives
* Ascertain missing tool functionality for an overall warehouse technical architecture.
A warehouse meta-data management initiative must address the extent to which an organization's current tools are being used to document valuable metadata for a data warehouse.
These meta-data source and/or user tools are categorized according to which one of the following functions they support:
* Provide documentation about the operational systems data and externally obtained data that is the source of the warehouse data
* Support the data warehouse life cycle as defined by the functional divisions Source, Load, Storage, and Query
* Manage meta-data.
Let's examine each of these functions.
DATA DOCUMENTATION
Meta-data about operational legacy systems may exist piecemeal in several types of libraries. The following methods can glean meta-data from these sources:
* Data Definition Language (DDL) may be reverse engineered into models that provide a general representation of the involved data.
* Program code can be scanned and analyzed to produce file and record layouts.
* Job and copybook libraries can be scanned for additional data documentation.
The reverse-engineering and scanning functions available in CASE and repository tools vary in their thoroughness but are generally a good means of gathering meta-data about a warehouse's source data. Some organizations maintain documentation about their operational systems data by committing to logical data models (both application-specific and integrated) within one or more CASE tools as the starting point for all systems-development efforts.
FUNCTIONAL DIVISIONS
The various steps of data warehouse development-extraction cleansing, validation, and transformation-use existing meta-data about data sources and create new meta-data to evaluate and format data for the target warehouse environment. Extract tools, for example, need information on the data's format at its source, the selection criteria, and how it should look after extraction. Cleansing and validation tools need edit and integrity rules about the data to execute their correction and merge routines. Transformation tools need a combination of knowledge about the data at its source and target, edit and integrity rules, and specifications for changing data into the intended warehouse structure and content.
Although extract file layout metadata is useful only to warehouse development personnel, the algorithm for how a piece of warehouse data was calculated is critical to a businessperson's use of the data warehouse.
A few types of tools assist in the access to and ongoing maintenance of warehouse data. Query and reporting tools may contain meta-data components that attempt to bring the pertinent meta-data to a businessperson's desktop side by side with "real" warehouse data. Warehouse management and database and load utility tools provide warehouse development personnel with the substantiation necessary to assess warehouse data usage and warehouse DBMS performance. The tools' meta-data is used to secure access, monitor volume, plan data archiving, and manage change.
META-DATA MANAGEMENT
This repository-based framework assumes that a centrally managed repository tool, or set of tools, is available to integrate, version, and synchronize all warehouserelevant meta-data with its corresponding "real" data and system counterparts. This meta-data must be administrated, secured, and made available to all interested audiences. All of these activities may take some creative development.
For many organizations, developing a data warehouse provides an impetus for integrating data from various sources into a consistent format and structure-a level of integration that was not previously undertaken among separate operational and external systems. This process brings various meta-data to a central storage point where compare and merge functions can help automate the integration process. Impact analysis-undertaken from several starting points and across several types of meta-data-speeds up the decision-making involved in developing a consensus view of warehouse data.
In the past, systems users received minimal amounts of meta-data about an operational system at the time of production implementation (in the form of user documentation manuals or brief field-sensitive online help text). Because production systems rarely changed significantly, metadata documentation rarely changed. However, data warehousing has introduced an environment in which data will continually increase and will be summarized and aggregated according to ever-changing algorithms. Meta-data must keep pace. The warehouse environment requires more meta-data-for example, to substantiate the current algorithms and track how warehouse data was calculated in the past. Meta-data about "real" warehouse data that has become historical is also required and needs versions to the extent that the "real" warehouse data has versions.
User documentation manuals, which are cumbersome to produce and keep up to-date, no longer suffice as a meta-data source for warehouse users; neither do IT oriented mechanisms, to which not all knowledge workers have access, and which have low user comfort levels. Meta-data must now be provided through businessfriendly mechanisms that are current with the warehouse access technology itself, and in time frames and formats that meet business needs. For example, periodic subject area attribute lists or business cycle transformation rules may be needed in report format to facilitate warehouse project signoffs, warehouse usage audits, and other warehouse management milestones.
As additional warehouse data is made available to potential knowledge workers and more knowledge workers need to know what warehouse data is available, it is increasingly important that these knowledge workers have the appropriate metadata access permissions. It is no longer sufficient to make the meta-data available within the same constraints as systems data. Warehouse meta-data administrators must now mirror the administration functionality of traditional production systems.
Meta-data about warehouse data must undergo more extensive synchronization than meta-data for the production phase of operational systems. When warehouse data is replicated to remote locations, corresponding meta-data must also be replicated. When warehouse data is selected and aggregated to satisfy departmental or functional perspectives for a data mart, or when it is given little continuing analytical value and is slated for archival, the relevant meta-data must accompany it. Whatever meta-data work remains must precede production warehouse meta-data management.
Moving and managing meta-data among warehouse development tools is the largest integration issue reflected by the meta-data management framework. For one tool type to reuse the meta-data developed in another tool type, meta-data formats and structures must be able to cooperate across any existing tool interfaces.
A particularly challenging issue is the recognition and timely handling of changes in the data sources that feed the warehouse environment. Proactive recognition and triggering mechanisms will minimize negative impact on the warehouse's Source, Load, and Store functions. How well warehouse tool interfaces and a central repository toolset provide these mechanisms is a significant evaluation criterion and implementation requirement. The meta-data management functions contained in this framework must be implemented for a warehouse environment, whether or not they are provided through a centrally managed repository toolset.
The most common alternative to a repository-based meta-data management architecture involves a pair-wise interface between tools. In this approach, the tool considered to be the primary meta-data source propagates the appropriate metadata directly to those tools that use it. For example, data quality assessment and data transformation tools need access to metadata describing the source applications' data structures. Data transformation tools need access to meta-data about the warehouse's database structure. The data transformation tool may be the "system of record" for data mapping and transformation rules between the source applications and warehouse data. Warehouse users need access to this same meta-data through their query tools. Tools can often extract the meta-data they need from the source tool. For example, data quality assessment and data transformation tools can reverse engineer meta-data about source application data structures, and query tools often have a facility for importing meta-data. Direct interfaces between tools do allow metadata sharing-but at what cost? Tasks such as reverse engineering must be performed redundantly. A tool may be able to interface with only a subset of available tools, leaving warehouse developers to build interfaces to other products using a generic import/export facility. Finally, few tools support version control and impact analysis.
Although the repository-based metadata management environment provides a central clearinghouse for meta-data management, versioning, and sharing, few repository products interface with all available data warehouse tools. Warehouse developers must expect to develop some custom code for migrating meta-data among tools, regardless of which meta-data management strategy they adopt.
The main challenges for effective warehouse meta-data management are choosing which tool should maintain the only "master" copy of a piece of meta-data; limiting (or, at a minimum, coordinating) the tool within which that meta-data will be created; and determining how many tools, if any, need replicated meta-data to satisfy all audiences' requirements. Base these decisions on an understanding of the available tools and their interfaces, comprehensive modeling of the pertinent meta-data, and explicit assignment of each piece of meta-data to a tool in accordance with its create, retrieve, update, and delete action.
The data warehouse meta-data management framework does not suggest that any particular tool is the most appropriate choice for a tool category. Nor does it assume that vendor tools currently exist for every tool category. Some tools contain functionality that spans tool categories within this framework. Likewise, the interfaces among tools have varying levels of completeness and usability. Future columns will address specific tool categories and additional considerations for effective meta-data management within this framework.
REFERENCE
1. Moriarty, T. and R. Greenwood. "Data's Quest-from Source to Query." Database Programming &Design, 9(10): 78-81, October 1996.
Terry Moriarty, president of Inastrol, a San Francisco-based information management consultancy, specializes in customer relationship information and metadata management. Her common business models have been used as the basis of customer models for companies within the financial services, telecommunication, software/hardware technology manufacturing, and retail consumer product industries. You can reach her at terry@inastrol.com.
CHRISTINE MANDRACCHIA is data warehousing methodology developer for Knowledge Partners, Inc. located in New Jersey. She can be reached at 102220.1061@CompuServe.com.