T E R R Y M 0 R I A R T
Y A N D B 0 B S C H M I D T
Mining for Metadata
May, 1997
Has your company treasurer ever tried to explain futures contracts to you? If so, you probably had to absorb a wealth of information delivered in a "stream of consciousness" and then discern its essence-not an easy task. Confronting an existing database or program poses a similar problem: The business you have to understand is hidden in a cloud of jargon specific to the technology in which the application is implemented. Each nugget of business meaning is so slimmed down that "Futures expiration date" becomes FTR-EXPD. Recognizing the essence of the application is also, unfortunately, very difficult.
In a perfect world, each application is fully documented in a browsable repository, in which each application component is fully defined and its linkage to other components easy to navigate. In a perfect world, every application component can be traced back to the business concept and rules that it supports. In a perfect world, the business environment is fully documented through an integrated set of models that describe the business processes and the data those processes use. In a perfect world, new business rules are incorporated into those business models before any changes are made to the supporting business and information technology systems.
Unfortunately, this isn't a perfect world. In most cases these days, neither the business environment nor the applications portfolio is well documented, so the effort to recover that knowledge can be daunting.
REVERSE-ENGINEERING BASICS
Interviewing subject-matter experts and reviewing existing systems are essential parts of any successful business analysis effort. Fortunately, these two tasks can complement each other so that each makes the other easier. However, to understand business systems, analysts need a process for extracting business knowledge from those applications. That process is called reverse engineering.
Reverse engineering is the process of identifying the components of an application (programs and data, for example) and how they're linked. It involves the mining and extraction of business rules enforced by the application. Results of the reverse-engineering process are then stored in a database.
Although automated tools designed to identify application components and interrelationships are available, other metadata-such as business definitions and stewardship responsibilities-must be entered manually. In addition, business rules that cannot be extracted from the code using automated tools must be uncovered manually. The database can also capture the transformation rules between application databases, such as between a data warehouse database and its source-data elements. When business models exist, reverse engineering involves tracing application components back to the business concepts they support. In most cases, this effort is also a manual one.
Given the intensive manual efforts necessary to reverse engineer an application, you may be wondering why your organization would want to initiate this type of project. The reason is relatively simple: Reverse engineering creates a central metadata repository about the application portfolio. When the application portfolio is translated from the language of the database administrator and programmer into that of the data administrator and business analyst, your team can more easily:
* Raise the level of abstraction to merge disparate file systems (such as VSAM, relational, and object systems).
* Normalize disparate metadata (such as alias names, formats, and business definitions).
* Conduct automated impact analysis on an individual application and across the application portfolio.
* Uncover business rules from application code (such as valid values, calculations, default values, and dependencies between data elements).
* Identify unused data.
* Develop metrics on quality and complexity that can be used to help establish priorities and estimate costs.
* Use the new metadata as the basis for forward engineering the application.
* Show all metadata graphically (such as relationships between application components).
Suppose the data for an application is implemented in two different databases. During implementation, lastminute changes to each database design were not documented. After reverse engineering each database into physical models in a single common format, making comparisons can identify how the two database designs have diverged. The same method can be applied when comparing future and current states for projected systems.
WHY REVERSE ENGINEERING?
Reverse engineering is not performed for its own sake. Rather, it can be an integral part of other larger projects, such as:
Reengineering an Existing Application. The most obvious justification for reverse engineering is to understand an underdocumented application that's been targeted for reengineering or replacement. In many cases, an application is retired not because the data and business rules it supports are invalid but because technology has improved. For example, the code may be brittle and difficult to maintain; the organization may be migrating to more modem platforms; or the user may want to replace an older, character-screen interface with GUI front ends. In each of these cases, the business rules enforced by the current application must be converted to the new environment. Reverse engineering provides a mechanism for extracting and examining the application's data and business rules so they can be forward engineered, where appropriate, into the new application.
Evaluating Third-Party Packages. Before an application is replaced with an off-the-shelf system, both the application being replaced and the new package can be reverse engineered to give you an idea of the "fit" between the new software and your organization. The very idea of reverse engineering makes some vendors swoon, because they consider the models that underlie their software to be intellectual property. But such vendors should be less concerned about piracy and more concerned for their clients' need to thoroughly evaluate the package's design. After all, a package that supports 90 percent of your organization's functionality but only 90 percent of your data requirements is probably not a good fit.
Sourcing a Data Warehouse. Before sourcing data into a data warehouse, you must fully understand the data held in the source applications-including its design (such as the formats held in the application's database) and the business rules that affect the values the data can assume. Data warehouse users need access to the valid values, derivation rules, and any logic that affects how the data is populated in the source application. Consequently, reverse engineering only those data structures that transmit data to the data warehouse is insufficient; the source application's code must also be analyzed to extract any relevant business rules.
Implementing Mandates That Span the Application Portfolio (such as Year 2000). Occasionally, we have to make changes that affect the entire application portfolio. For example, the effects of expanding a data element that is shared corporatewide-such as account numbers or the organization's business-unit identifier-can ripple across the application portfolio. The Year 2000 initiative, which requires the examination of every date to ensure an application's viability after the century turns, is a good example of this type of global mandate. Reverse engineering is an ideal technique for Year 2000-related projects, because each program must be examined to identify date fields.
Incorporating Managed Application Techniques. Organizations managing their applications along an evolutionary management framework such as the Software Engineering Institute (SEI) capability maturity model must migrate the metadata that describes their application portfolio to an automated platform. This platform can control enhancements and measure the effectiveness of an organization's application-management process. To take an application through the SEI stages, its specifications must be moved at some point from text-based documents, code, and database catalogs to an automated environment. Reverseengineering techniques can support this migration.
PERFECT TIMING
At what phase in a project's life cycle should reverse engineering take place? As soon as you define a project, you can use a rough-cut reverse-engineering analysis to scope and estimate the total effort required to understand the application. If you can't describe the scope of the project in terms of the major files that will be affected, you might need to improve your scope statement. When you can identify those files most likely to be impacted, you can use the reverse-engineered model to enhance order-of-magnitude cost estimates.
The reverse -engineered model is a valuable tool for bringing the analysis team members up to speed about the source application's functionality. You should complete much of the reverseengineering effort before conducting any interviews or analysis sessions, because when business stakeholders discuss their perception of an application and its data, they tend to repeat a few notable stories over and over. After all, people usually talk about the interesting exception and assume you know the ordinary information or "base case."
The existing system is full of this base-case information, which is often buried in the application's code and database specifications. Therefore, the models you create through reverse engineering will provide the application's base-case data and business rules. By combing the diagrams generated by reverse engineering before interviewing business stakeholders, you will be better prepared to "weed out" interesting, but unhelpful, anecdotes about exceptional experiences. For example, a review of the model might lead to observations and questions such as: "Every product has a price. Do we always charge the same for a given product?" Questions such as this one can be used to structure and focus the interviews.
SOME ADVICE
The reverse-engineered model plays an essential role in validating the target model and populating the target environment. Every relevant item in the source system should be referenced to the target model using one of the data mapping scenarios shown in Table 1. The mapping cardinality is from the source or original field to the target or new field.

Reverse engineering requires the patience of an archeologist; it calls for careful analysts who revel in wading through masses of detail to find a nugget of truth. Each data element must be inspected for relevancy to the target database. Backward-chaining through the source application's code may be necessary to uncover business rules that affect the data's behavior. The map you create between the source and target becomes the mapping table that drives data transformation rules.
The road to reverse engineering is littered with obstacles (see Table 2). Here are just a few tips to help you avoid them:

Don't "Go Native." If the existing models are the starting point for your reverse-engineering effort, don't look at them too long-you may end up creating the same models for the target environment. Remember that one of the most valuable qualities you can bring to the project is a fresh perspective. If you're too committed to the status quo, you can't maintain a zeal for originality.
Avoid Fixes for the Source System.
If you decide to reengineer or replace the source application, you may get bogged down in implementation issues too early. Every data design has its rough spots, even glaring, gross errors aching to be fixed-but you should avoid getting into deep discussions about replication, distribution, and indexing with the maintenance team. Remember, your objective is to understand the source applications data and business rules, not to fix applications problems.
Don't Overanalyze. Playing with diagrams can be seductive. But your objective is neither to create a diagram in which no lines cross nor to normalize the data structures of the source application. Rather, your diagrams and other extracted metadata should have the single purpose of capturing the application's data and business rules and ensuring that each data element in the source application is accounted for in the target environment.
A NEW MODEL
The new data model for the target environment will be different from that of the reverse -engineered source application for several reasons:
* The business may have changed dramatically.
* The previous developers may not have understood data modeling principles.
* The relevant business area may not have been subjected to formal data analysis in the past (such as in most data warehouse efforts, in which business processes are being analyzed for the first time).
Someday, applications will be developed using sound engineering practices. Metadata describing application components will be an essential deliverable in every development project. Systems will track business evolution and how applications changed to meet new practices. But in the meantime, reverse engineering remains essential to understanding and managing the evolution of your application portfolio.
In the next Enterprise View, we'll show you how automated reverse-engineering tools can make this whole process much easier.
GENERAL REFERENCES
Moriarty, T. "Migrating the Legacy." Database Programming &Design, 5(12): 73 -74, December 1992.
Aiken, P. Data Reverse Engineering. John Wiley and Sons, 1996.
Terry Moriarty, president of Inastrol, a San Francisco-based information management consultancy, specializes in customer relationship information and metadata management. Her common business models have been used as the basis of customer models for companies within the financial services, telecommunication, software/hardware technology manufacturing, and retail consumer product industries. You can reach her at terry@inastrol.com.
BOB SCHMIDT is the founder of agpw inc. and the author of the DataMaster series in data modeling. Nationally recognized as an expert on data, he is a frequent speaker at information technology conferences. You can reach him via e-mail at schmidt@agpw.com.