Terry Moriarty and Suzi Hellwege
Criteria to help you decide if a packaged migration tool fits your organization
Evaluating Data Transformation Tools
March, 1998
Whether to make or buy your migration tool is the first key issue you'll face when designing your enterprise's data migration facilities. What mechanisms will you use to perform the actual migration of the data from source applications to target databases? If you decide to consider commercially available products, you'll need criteria to test each tool's suitability for your environment. This column provides a summary of criteria we've used in the past to evaluate data migration products. While data migration includes extract, transformation, and load processes, we'll focus here on evaluation criteria for data transformation and leave the discussion of evaluating the other two processes for a future column.
DATA TRANSFORMATION
One of the first things to determine about data transformation support is the tool's ability to define, read, scan, and extract data from your source application. Some of the primary compatibility areas to look at are the program language that the tool uses and supports (such as Cobol, C, or C++), the data definition languages (such as DB2, Oracle, IMS, or IDMS), and the tool's ability to import data structure from either a DBMS catalog or a CASE or repository product. Also, if the tool is not able to read in all your data structures, does it give you the option of entering the data structure definitions manually?
Once data structure specifications have been determined, the next thing to review is how the transformation rules are specified. Each tool will have the ability to support a number of automated data type conversions. Determine what types of transformations the tool can support and the method with which it provides this support. Will it adequately support the data type conversions your target technology requires? Some of the more common required transformations are:
* Straight moves (move source field A to target field B)
* Temporary fields (calculated fields held throughout processing, such as counters or running balances)
* Calculations (arithmetic data conversions, using both target and temporary data-for example, Target = (2>( SourceB)/ (SourceC-TempE))
* Statistical calculations (TargeA=AVE(B), where AVE is the average function and B is the domain within the source field for which the average function is performed)
* Summary calculations
* Table lookups (Targe~=Name(B), where Name is the index for an external table or file and B is the name of the field within that file)
* Conditional transformations (transformations based on Boolean logic, for example: If A = X, then B C, else B = 0)
* Concatenation (Phone Number = Area Code + Prefix + Line Number)
* Substring operations (for example, Area Code = Phone Number (1,3))
* Data type conversions.
If the tool doesn't handle all your data type conversions, look at the capabilities it provides for supporting user exits, and look at windows where it can be exited and where users can write code to support nonstandard transformations. Be sure to check that these user exits support your organization's programming language of choice.
Also look at how the tool lets you enter and maintain the transformation rules themselves. Does the application support graphical data mapping? Will users be able to perform point-and-click data entry? Look at the tool's ability to provide mapping assistance and shortcuts such as autolinking data elements with common names or searching for a name within data already entered. What other mapping assistance does the tool provide?
As a part of your data transformation, you will need to define record selection within the tool for one or many source systems. A major consideration in this area is the tool's ability to capture selection logic to process a subset of the source data. Can you use Boolean logic to define this subset? Does the tool support breakpoint logic for processing, and what calculations does it let you perform within these breakpoints (for example, statistical calculation and reporting)?
If your transformation will be mapping from multiple sources, data integration and synchronization functionality are critical. Be certain that your tool has the ability to process multiple source files concurrently and match records representing the same business instance across those files (for example, when data for the same customer exists in multiple source files). Can the tool construct a single record or set of related records from the records that were matched across multiple source files? If an error is detected in the synchronization process, such as a missing record in one or more source files, can the tool detect this condition? If so, what options does it give users to identif~ and correct this condition (stop processing, report error, continue processing)?
As you move through your transformation process, you will require various levels of summarization and reporting. Identify the levels of summarization available within the tool. Are these features built in, or must users develop them? Does the tool provide sufficient summary functions (such as average, mean, and total) and statistical functions (such as standard deviation) for your data analysis needs?
Look carefully at the tool's standard reporting functionality. Can you pull out information at the data element, target-column, and source-field level? For each transformation rule, can the tool list source field, target column, and processing specification? When an error is detected or an unmapped condition is encountered, can the tool clearly identify the condition to the user for evaluation and possible correction? If reporting is provided, where in the process does this occur~as the condition is discovered or at the end of processing? Can users report on data entered into the tool as well as data calculated throughout the transformation process (such as temporary fields, calculations, and summarization totals)? Again, if the tool doesn't support the required reporting, will it allow users to exit and define their own reports? If so, check the programming language required for this report definition.
Finally, review the tool's execution specification. Can the tool report on the execution order of related transformation rules (such as all the transformation rules that are included within one pro-gram/compilation unit)? Which runtime environment does the tool support, and does the tool require that its metadata repository be available in the execution environment? Does the tool generate code for the transformation rules?
CHOOSING A MIGRATION APPROACH
Two categories of data migration products exist:
* First-generation products-tools that generate programs in a programming language, such as Cobol or C
* Second-generation products-rule-based tools.
Rule-based products actively use the transformation rules coded in their metadata repository in the operational environment. The value of this category of tools is that they allow you to change easily and implement transformation rules. You don't have to generate, compile, and test code outside the tool. However, you must carefully consider the ramifications of using such tools, as they may require a runtime component that has access to the metadata repository. Furthermore, you must consider whether your organization's position with respect to "open architectures" restricts the use of tools that are considered to be "closed" because they use their own proprietary database or do not provide a migration path to a more conventional development platform.
Should you decide that a second-generation data migration product is suitable for your organization, carefully evaluate the environment the product provides for controlling changes to the data transformation rules. How are changes to rules tested? What mechanisms exist for migrating the new rules into the production environment? Does the use of the runtime engine cause any performance concerns?
If you decide your data migration tool should have an open architecture, it must be able to generate code to support the data transformations. The criteria for evaluating code generators are not unique to the data migration process. If your organization invested in code generators for new application development, much of the criteria used in selecting those products can be used to evaluate the code-generation capabilities of data migration tools.
When evaluating a product's code generation capabilities, look for strengths in the following areas:
Programming languages supported.
Does it support your organization's preferred languages for the data warehouse environment? For example, if your company is using C++ for its client/server applications, you shouldn't consider a tool that only generates Cobol and runs in the IBM mainframe environment.
Support for user exits. Even though the types of data transformation rules the tool supports may be very robust, you can't guarantee that all your organization's transformation rules can be specified using the tool's constraint language. The tool should provide support for user exits that can be used to incorporate your organization's custom code into the program logic. The tool should provide multiple points in the sequence of the transformation logic where user exits can be included.
Compilation, test, and execution environment. The tool should let you choose where the code it generates is compiled, tested, and executed (whether at the desk-top, server, or mainframe). Likewise, you should be able to select different environments for each development life cycle phase. Ideally, the tool should be able to integrate with your standard development environment. For example, if you normally develop, compile, and test code in the client/server environment even though the programs are intended to execute on an IBM mainframe, the tool should be able to accommodate this type of development environment.
Program optimization. Is the code that the tool generates optimized for your target execution environment? The generated code must perform efficiently for the type of processing your organization performs (for example, matching multiple files, sorting, internal tabling of reference data, real-time database access).
Runstream support. A data migration tool must be able to generate the runtime instructions (for example, in IBM Job Control Language or Oracle Scripts) necessary to compile and generate code and execute the extract, transformation, and load processes appropriate for the target environment. Ideally, the tool will let you customize these run-stream commands to conform to your organization's standards.
Maintainability. Finally, you must assess the facilities to create and maintain the program code. How easy is it to change the logic that is generated into code? Can the tool recognize any customizations made to the generated code and incorporate that code when the program is regenerated? Does the tool provide adequate reporting on the code it generates and the errors it encounters? Can it report on changes made to a program and highlight the location of divergence from the original program?
METADATA MANAGEMENT
When selecting tools to support your data migration process, you should evaluate what types of metadata the tool requires and how well it shares this metadata with other tools. Most tools that support the data warehouse environment are metadata-based. For example, a data migration tool uses metadata to specify its transformation rules. An OLAP tool uses metadata when assisting end users in developing queries and interpreting the results of running queries. Furthermore, these tools often require access to the same metadata. Both data migration and OLAP tools require access to the metadata that describes the source data and the target database structures.
Criteria for evaluating a tool's metadata management environment focus on how well the tool integrates with other tools that create and use metadata. What facilities does it have for importing and exporting metadata? Is the format of its import/export structures documented and easy to understand? Does it support industry-standard metadata exchange formats (such as CASE Definition Interchange Format (CDIF) or the Metadata Coalition Interface Standard (MDI S))? Can it interface with other tools to share metadata (CASE, repositories, OLAP tools)? Does it share metadata through a direct connection, or does it use an import/export process in which the metadata is exported from the host tool in the first step and then imported by the user from within the target tool in the second step?
The tool should be able to generate reports that assist the data migration analyst in validating the transformation rules and data usage. We've provided the evaluation criteria for this type of reporting in the Data Transformation section of this column. The tool should organize its metadata to support "where used" analysis for both source and target data. It should also be able to list all the data a specific transformation rule uses, as well as view the transformation rules to which a specific data element is input or output.
Many data migration tools have evolved into fill' repository products. In this case, your data migration tool can serve as the hub of your metadata management environment. The evaluation criteria for a repository product are beyond the scope of this article. However, you may find the criteria presented in "So You Want an Information Repository?" (Database Programming&Design, May, 1989) useful in evaluating your data migration tool's repository capabilities.
Performing an evaluation of data transformation tools will ensure that you will be able to develop a clear and concise plan for your data transformation, with a full understanding of where your chosen tool can automate your process and where you need to focus your conversion resources. In future articles, we will provide our assessment of the more popular data transformation products against our evaluation criteria.
Terry Moriarty, president of Inastrol, a San Francisco-based information management consultancy, specializes in customer relationship information and metadata management. Her common business models have been used as the basis of customer models for companies within the financial services, telecommunication, software/hardware technology manufacturing, and retail consumer product industries. You can reach her at terry@inastrol.com.Suzi Hellwege is vice president of InfoSolutions, a San Francisco-based information management consultancy. You can reach her at Suzi.Hellwege@yahoo.com.