So You Want a Repository
5/89, revised 8/98
Thomas A. Bruce, T.A.B.S.E.T, Jonah Fuller, DataLedger, Inc., Terry Moriarty, Inastrol
COPYRIGHT (C) TABSET / DATALEDGER, INC. / INASTROL, 1989-1998
All Rights Reserved. No part of this publication may be reproduced in any way without prior written permission.
INTRODUCTION
Have you notice how the way we construct applications is changing? Automation is finally hitting our own profession. Computer Aided Software Engineering (CASE) tools are being evaluated, piloted or used in nearly every major development organization. Gone are the days of hand-drawn (and often hand lettered) requirements and design models. Now they are developed and maintained on PCs that also provide crisp laser printed documentation. Screen dialogue prototyping provides interactive requirements acquisition and application design capabilities that interface with mainframe code generators.
Most definitely, the cobbler's kids are finally wearing the latest in designer footwear. And the shoestring that keeps these tennies tied together and functional is a central, integrated database of information required to develop and manage software constructed in the CASE environment.
At the same time, enterprises in government and industry are recognizing that information is an important business asset that must be managed and developed as any other business resource. In many, the need to manage the data that describes the business information is being emphasized. There is a great deal of overlap in the data that an enterprise needs to manage its information asset and the data that information engineers require to develop applications in a CASE environment.
Suddenly, Data Administration (DA) finds that it is responsible for one of the most databases in the enterprise. The lowly Data Dictionary has evolved from a data element documentation tool to the Information Repository, the knowledgebase of information that assists information specialists in their efforts to document and manage the enterprises's information asset and the portfolio of applications that has been configured to support that asset. This same repository serves as the heart of the Application Enabling Environment (ADE) which is the foundation of the enterprise's suite of CASE tools.
It has fallen to the Data Administrators to develop the requirements for managing the enterprise's information resource within a CASE environment, decipher the evolving Repository standards, evaluate the available products and select the Information Repository product that best meets their organization's needs. Hopefully, this article will help our fellow DA's analysis process by providing some criteria that can be used in evaluating Information Repository products.
As in any situation where the environment is evolving and in flux, the vocabulary used to describe and discuss the information repository tends to be ill defined and/or defined inconsistently. Being first and foremost good DAs, we find this situation somewhat intolerable. Therefore, for the purpose of this article we will use the following definition of terms:
a knowledgebase which integrates information about the enterprise's Business Data and its Application Portfolio.
an application which provides controlled access to the Information Repository. According to the original ANSI Information Resource Dictionary System standard, an IRDS is "...a key computer software tool for the management of data and information resources. It provides facilities for recording, storing and processing descriptions of an organization's significant data and data processing resources."
the combination of the IR and the IRDS
An enterprise's Business Information Resource (BIR) consists of two parts:
Many enterprises have information systems that are currently operating, are in development or are planned to be developed to manage the information asset. This set of information systems is commonly known as the enterprise's Application Portfolio (AP).
The Business Information Resource and Application Portfolio are described in terms of components that we will refer to as Data Objects. These data objects, the relationships among them and the facts that must be known about them are an important part of the Information Asset. The information about these data objects must be maintained, secured and distributed to the appropriate parties in ways that are similar to the maintenance, security, and distribution of business information.
An Information Repository serves as the source of information to be used by:
The people, automated tools and applications that access the information repository are referred to as "Users".
PurposeThe purpose of this article is to offer criteria to be used as the basis for evaluating a Repository. The series of questions provided below is intended to help determine the scope and quality of the functions and data structures supported by the system being considered, and the degree to which it adheres to evolving industry standards (e.g. OMG Metadata Object Framework).
ScopeIn preparing this article, we have assumed that the reader is familiar with basic information repository concepts, and we have limited the criteria to repository functions only. We recognize that many vendors provide or will provide additional tools which are designed to use the information repository such as data structure generators, graphic model generators, code and data structure decomposition and analysis support, and so on.
We have not included criteria for evaluating these types of tools, although their availability will surely be a serious consideration when choosing a repository. Although it is an important component of the Application Development Environment, the repository is only one such component.
This article is not intended to be a set of guidelines that will determine with certainty whether a given repository can perform to your satisfaction. We have only attempted to addresses those criteria necessary to determine if a product warrants further in-house evaluation within the constraints of your current ADE.
Each evaluation criterion is presented as a question or checklist. These criteria can be easily tailored to fit the evaluator's objectives. Not all of the questions found in the document may be needed at any one time.
REPOSITORY EVALUATION CRITERIA
There are three modes in which a Repository can be used within an enterprise:
Passive
The information repository is used primarily as a documentation tool that is accessible by people. It is not used by automated tools or applications.
Active in Development
The information repository is used as a documentation tool for people. It is also used during the development of an application to automate some of the software engineering steps. As such, it is used by automated development tools, often referred to as Computer Aided Software Engineering (CASE) tools, to generate data structures, code, documentation, test cases, databases, models, etc.
Active in Production
In this situation the information repository is used as more than a documentation and development tool. It is accessed by applications as the mechanism to gain access to the enterprise's databases. The data integrity, validation and security access rules are enforced through the IR. If changes must be made to any of these rules, they are made in the IR. Often, no changes to the applications that use the information that was changed are required.
1. Is the Repository passive?
2. Is the Repository active in development?
If so, to what degree?
3. Is the Repository active in production?
If so, to what degree?
A number of IRDS standards have been proposed by various standards group:
In addition, Microsoft is developing its own repository product. Given the influence that Microsoft products tend to have on the marketplace, its repository architecture can also be considered to be a de facto standard.
The discussion of what these standards entail, how they differ and the current status of the proposals is a topic best saved for another article. However, it is important that the evaluator of an IRDS product be aware of these standards and, if possible, come to a conclusion as to which standard best fits the needs of his/her enterprise.
1. Which IRDS standards does the IR conform to:
2. Does the IR offer extensions to its base Standard? Is so, will using these extensions preclude your ability to interface with tools and/or other IR products that required strict adherence to the Standard?
The Data Object Model is the basis of the structures that are supported in the IR. Normally, the Data Object Model is discussed in terms of an Entity-Relationship Diagram (ERD). This is the conceptual view of the actual structures supported by the IR database. To adequately evaluate the Data Object Model support features of an IR, it is important to develop your own model of the data objects your enterprise has identified as needed to support its information resource.
A Data Object Model includes:
The IR's Data Object Model is specified in terms of:
The Data Object Model corresponds, in concept, to what is commonly referred to as a Meta-Model.
A Data Object Instance is a single information resource or application portfolio component defined to the IR as a Data Object. For example, if the enterprise must have information about a Customer to conduct business, then Customer is a component of the information resource. Customer is defined to the IR as an Entity, which is a data object, and therefore becomes a data object instance in the IR.
If so, are there any constraints on the degree to which you can modify the supplied Data Object Model?
If so, are there any constraints on the degree to which you can delete data objects, relationships and/or facts from the supplied Data Object Model?
Data integrity concerns the accuracy and validity of data stored in the IR. The following criteria can be used to ensure that the IR provides an adequate level of data validity and integrity checking. It is important to understand that the allowed values of "facts", allowed relationships among facts, etc., which are often referred to as "domain" considerations" are an important part of the Data Object Model, and should be controllable like any of the parts of that model.
a. Numeric (Integer or Real)
b. Alphanumeric
c. Alphabetic
d. Text
e. Date
If so, is there an upper limit on the number of decimal places allowed?
a. Domain Integrity:
1) Ranges
2) Data Type
3) Uniqueness
4) Inter-fact dependencies
b. Entity Integrity:
1) on Insertion
2) on Replacement
3) on Deletion
c. Referential Integrity (IRD Rule Support):
1) Set Null
2) Restrict
3) Cascade
If so, are all of the features of that DBMS available to you through the IR and IRDS?
An important objective of information resource management is to eliminate redundancy and to ensure that data object instances are named properly. Most enterprises have naming standards for how data object instances are named and the abbreviations to be applied to those names. The ability to enforce the enterprise's naming standards is an important consideration in choosing an information repository.
1. Does the IR provide the ability to generate a unique name for a new data object instance?
2. If a system generated name is provided, can the algorithm be tailored by the enterprise?
3. If a system generated name is not provided, how does the IR ensure that each data object instance has one unique name?
4. Does the IR provide support alias (alternate/enterprise defined) names?
5. Does the IR provide the ability to identify for what purposes a given alias can be used (e.g. Customer Name can be used as a Business Name, Information Model Name, Report Heading, CUST-NM can be used as a COBOL Name, DB2 Name, etc.)?
6. Does the IR allow the same alias to be used for multiple data object instances?
7. Can the IR enforce uniqueness rules for data object instance aliases across data object instances?
8. Does the IR allow multiple aliases for the same purpose for the same data object instance?
9. Can the IR enforce uniqueness rules such that:
a. a given data object instance can only have one alias used for a given purpose
b. when used for a specific purpose, a given name can be used as an alias for only one data object instance
10.
When presented with a new data object instance and its name, does the IR attempt to determine if the data object instance already exists? Does it attempt to determine if the name already exists?11. Does the IR maintain standard abbreviation lists?
12. Does the IR maintain name construction rules?
13. Can the abbreviation lists and name construction rules be tailored by alias purpose (e.g. the rules and abbreviations used to construct COBOL names may be different from those used to construct Business Names or C names)?
14. Can the abbreviation lists and name construction rules be tailored by application?
15. Does the IR maintain a mapping between abbreviation lists so that names can be converted from one standard to another?
Just as it is important to ensure that the contents of the data base are valid and accurate, it is important to ensure the data base is not physically destroyed. Criteria to make that determination noted in this section:
This section concerns data security (sometimes referred to as data privacy). In any case, this section addresses the confidentiality and protection of information within the data base. The criteria recognizes the need of more than passwords for IR data and function access.
a. Function
b. Data Object
c. Data Object Instance
d. Data Object Fact
e. Data Object Fact Value or Range of Values
f. Data Object Relationships (provide the ability to view a data object instance, without the need to navigate through the relationships in which the data object instance participates)
2. Does the IR protect data from modification by unauthorized individuals?
3. Does the IR distinguish between read, update and delete access authority (i.e, can a user be authorized to read a record or an item but not to update or modify it)?
4. Does the IR provide a means of logging and analyzing unauthorized access attempts?
5. How does the IR inform the user that the data and/or function access attempted is restricted?
6. Does the IR provide a means for documenting who has made what changes to the data base, both at the data object definition and instance levels?
7. Does the IR provide a facility for encrypting sensitive information?
8. Are there downside effects of the levels of security provided in terms of performance overhead?
The ability to place data object instances under change control so that the impact of modifications to existing definitions can be closely monitored is another important feature of an IR.
An Application is always complete, but never finished. The need to develop evolutionary applications which are architected for flexibility and change is obvious.
Maintenance is the process of implementing changes and enhancements throughout the life of an application. Version control is a set of procedures and techniques that allow multiple incarnations of the same data object instance to exist simultaneously as the business evolves over time.
There are two levels to version control: Data Object Instance and Release.
An Data object Instance is an identifiable constituent of a business model or application. Examples include Entity, Business Function, Program, Database, Table, Data Element and Test Case. A Data Object Instance is documented as a Repository Object.
A Release is an identifiable packaging of related Data Object Instances that has been or is intended to be implemented as a unit. The changes to business models and applications can be managed through releases. Multiple releases of the same model/application may be in development and in production at the same time.
Version Control is concerned with the ability to modify data in the IR database without impacting the integrity of the relationships with the users of that data. Version control occurs at two levels:
Data Object Model version control deals with the ability to evolve the definition and structure of the enterprise's Data Object Model while providing support for data object instances defined under prior incarnations of the model. It is the option of each IR user to decide if and when conversion to a new Data Object Model version occurs.
Data Object Instance version controls deals with the ability to evolve the definition of and relationships between the enterprise's business information resource and application portfolio data object instances while providing support for those applications that were developed using the prior definitions and relationships. Each application must have the option of if and when to convert to the new version of the data object instance.
1.
Does the IR have the ability to support multiple versions of the Data Object Model?2. Can the Data Object Model change without impacting the existing automated tools and applications that were developed under the prior model?
3. Does the IR provide mechanisms to assist in the conversion of automated tools and applications from one version of the Data Object Model to another one?
4. Does the IR provide a system generated Id to uniquely identify a data object version?
If so, can the algorithm for generating this Id be tailored by the enterprise?
If a system generated Id is not provided, how is data object version uniqueness ensured?
5. Does the IR have the ability to support multiple versions of the same data object instance?
6. Is versioning enforced at the Fact level or at the data object instance level?
7. Can the data object instance change without impacting the applications that were developed using the prior definition?
8. Does the IR provide mechanisms to assist in the conversion of applications from one version of the data object instance to another one?
9. Does the IR provide a system generated Id to uniquely identify a data object instance version?
If so, can the algorithm for generating this Id be tailored by the enterprise?
If a system generated Id is not provided, how is data object instance version uniqueness ensured?
Release Management is used to identify a package of data object instances that are intended to be treated as a unit. Release version control is nothing new to application developers. Techniques have been implemented in most environments which allow a release to be based upon prior releases, but which can evolve independently of the prior releases. Repository release management is an extension to these application concepts.
Releases build upon one another, where each release represents the delta from its parent release, starting with the current approved business model or production version of an application. Each release represents a snapshot of the data object instances at a specific point in time. A set of dependent releases is referred to as a "release stack".
One of the most important features of an IR is its ability to provide data object instance information for analysis purposes. A full compliment of relational and text oriented selection capabilities must be provided. In addition, complex comparative abilities are needed to identify possible data redundancies or deficiencies.
A complete GUI management system that serves as the on-line front-end to the IR is essential. Such a facility should be easy enough for a novice to use, providing visually satisfying forms that are workflow oriented.
To be most effective at managing and disseminating meta-data across the enterprise, the IRDS needs to be web-enabled. Deployment of business models and associated documentation over the enterprise's Intranet can be a valuable approach for reviewing and approving meta-data.
One of the requisites in a workstation environment is the need to support multiple concurrent users. This requirement implies a networking configuration capability for data sharing in the DBMS. Therefore, the IR must have those mechanisms in place to handle concurrent processing.
In addition, these users may be accessing the IR from different computer and terminal types. The need to support multiple platforms is very real.
If so, does the IR maintain the knowledge of what has been checked out to whom?
7. Does the IR allow the same data object instance to be checked out to multiple users at one time?
8. Does the IR allow updates to data object instances that have been checked out to other users?
Performance is concerned with speed with which the IR responds to user requests placed against it, and the efficiency with which data can be stored.
The Information Repository is central to the Application Development Environment, but is not its only component. Other tools, e.g. systems analysis and decomposition tools, modeling tools, compilers, source code library managers (if not part of the IR) and source code scanners/parsers must also be available.
When we evaluate any vendor's IR, we must evaluate the vendor' support capabilities too. There are no "bug free" products on the market! So we must be assured the vendor will be there tomorrow as he is today.