XML Architecture for Interchangeable Data |
|
by Robert Aydelotte send us comments on this paper |
Introduction |
| The introduction of XML[1] has been the catalyst for data
sharing innovations in 2000. Activities around XML are exploding, and the
number of XML specifications, applications, and services have become too
numerous to track. XML will soon be the lingua franca of electronic
data interchange, but the use of XML is in a formidable stage.
Establishing a consistency of use across a community of users is a
necessary step to gaining the best advantage from XML-based
technologies. This paper seeks to describe some of the problems that hinder this consistency of use, and to suggest directions for addressing some of these problems. These problems are defined and discussed mechanistically - as if inherent in data - but it should be recognized that the fundamental issue remains a communication problem (among and between both people and computers). The true test of data interchange is if the meaning intended by the author is the meaning understood by the reader. |
Data Patterns |
Contextual DataOne view defines information as data in context. The term context refers to the items of data that 'surround' an individual item of data (a datum) that provide the what, where, who, what and how of that datum. This context occurs naturally for all real data, but the context is not always or easily captured. The separation of data from context is generally compensated for by the addition of definitions, creating specific data types. An example of this in E&P is the specification of distinct columns for measured and true vertical depths in a relational table containing picks in a wellbore. Each definition that explains that this data instance is either a measured or true vertical depth. Alternately, a single column could have been used for depth and the 'type' of depth could have been given via a relationship to the data creating activity. Choosing to describe context using definitions or relationships has been an issue as long information technology has existed. Definitions provide a more terse and 'fit for purpose' environment while relationships allow greater flexibility and expressiveness. The limitations of definitional implementations prompted the development and adoption of the relational model several decades ago. The root of this issue may be examined by recognizing that every datum is synthesized from collection of data, meaning that the concepts of 'raw' and 'processed' data are artificial distinctions, albeit useful. Stated another way, data has a fractal texture, with context describing the specific texture. When using data, a process-specific viewpoint is adopted, and, based upon the viewpoint, some parts of this texture are converted to definitional descriptors while other parts are extracted as related values. As different viewpoints are required by changing processes, different blends of definitional and relational styles are created, each containing specific sets of related, 'fit-for-purpose' data values. |
What are Data Patterns?A data pattern is a named set of related, 'fit-for-purpose' data values. The 'simplest' data pattern may be the contents of a row in an un-related table, where all the necessary data values are physically collected in a fixed definitional structure. This pattern consists of several data values but no relationships, as each data value is independently defined by a unique column definition and all real relationships between data values is implied by these definitions. The 'most complex' data pattern may be described as N individual data values tied together by up to N*(N-1) relationships, with each data value being stored in a separate table (or other structure). Also, the meaning each data value is given by instantiating appropriate relationships rather than having an imparted definition. Almost all common data patterns are combinations of definitional and relational structures. For example, a X12 or EDIFACT message consists of multiple selected records (with strongly defined types) with each record containing appropriate segments of data, with relationships held as repeated data values (i.e., foreign keys). When widely used, these data patterns are the structures upon which data interchange exists. One source of great difficulties in data interchange is the use of different combinations of definitional and relational structures. A second source of difficulties is the misuse of existing data structures, either by homonym (making an existing data structure mean something different) or by synonym (creating a different data structure with the same meaning). |
Data Patterns - Large and SmallData patterns are not limited simple definitional and relational structures. One justification for using relational structures is the need to imbed a repeating structure within a data pattern, This demonstrates that data patterns may be composed of data patterns, manifesting the fractal nature of data. It is possible, though infeasible, to completely decompose every data pattern into smaller parts - and these parts may themselves be decomposed, ad infinitum. To control the decomposition of data patterns, they are assigned labels relating to commonly used business concepts within communities of practice. These labels are assigned very strong definitional descriptions, but actually represent named, surrogate data patterns. The aggregation of smaller patterns into larger data patterns involves transforming (or dropping) the less important relational structures into named, definitional structures. There is a loss of data inherent in this process, however, making the reverse transformation difficult or impossible. When interchanging data between systems using different scales of data patterns, it is relatively easy to transform data into larger patterns. If the additional required data is available, it is also not always difficult to transform data into smaller patterns. It is very difficult, however, to complete the 'round trip' that re-creates an initial, detailed data pattern after transforming it to a 'summarized' data pattern. In addition to the specification of data patterns themselves, it is necessary to capture the business rules that define the how data patterns are transformed in scale (granularity) to other data patterns. These rules are often implied by the prose that describes the summarized patterns, but these descriptions need to be machine interpretable (if possible). It may also necessary to retain the detailed data patterns for use after the creation of the summary data pattern, eliminating the tedious and error-prone work of data re-generation. |
Processes and PatternsAs suggested earlier, a business process is the genius of each data pattern. The viewpoint generated by a selected business process suggests the proper combination of definitions and relationships, detailed and summary descriptions that are appropriate for that data pattern. In general, low-level calculations are best suited to data patterns of detailed definitions; conversely, high-level interpretations are best suited to summarized, more relational data structures. This suggests that data integration activities are easier to perform with summarized data patterns, while number crunching is facilitated by more detailed data patterns. Changes of viewpoints within a business process will predictably result in changes to desirable data patterns, and massive process revisions will have corresponding data pattern changes. Therefore, it is normal and natural for data patterns to constantly change over time. It is also to be expected that data patterns will not be consistent within communities of users, as each user will repeatedly customize available processes to the specific technical problems being solved. Only change is unchanging. |
Data Patterns and XMLXML has characteristics that make it extremely well suited to describe data patterns. Specifically, XML has a strong set of languages to describe data structures, and rapidly evolving capabilities to manage content within these structures. While defining, managing and facilitating data patterns may not seem to be the motivations behind XML, these capabilities are clearly visible in SGML, the grandfather of XML. The use of XML is removing the traditional barriers between documents, applications and databases. XML is genetically capable of document description. Applications build data structures in memory, but, for I/O, translate these to message structures (e.g., files) that have many characteristics of documents, making XML useful. Databases define multiple, massive and interconnected data structures that, in their entirety, may not be expressible in XML, but these may be described as a series of individual, interrelated data patterns. The consistent use of XML creates a significant opportunity for improving the semantic consistency for the data as it resides in and moves between documents, applications and databases. XML Files and SpecificationsEach XML file has the ability to reference another XML file that defines its structure. These structural definition files define the XML tags and their hierarchy, including composition and the optionality and cardinality of each component. The original XML file is declared to be "well-formed" when it conforms to the rules stated by the defining XML file. Currently there are many options being proposed as the "standard" language to write XML structural definitions, such as Document Type Definition (DTD) and XML Schema. These differ mainly in aspects dealing with the representation of data within an XML file, but both provide a hierarchy of tags that define the structure of an XML file. Collectively, these and other language definitions are used to create the definitions that referred to here as XML specifications. Elements are Data PatternsThe element in XML is an excellent representation of a data pattern. XML specifications are composed of elements, a concept inherited from SGML. An element is the base definitional unit for all XML specifications. XML also contains attributes and entities, but these are both entirely dependent upon entities. Each XML specification contains one and only one 'root' element, implying that the entire file defines the pattern for that element. XML elements are assigned labels called tags, which are unique within an XML specification file. This means that two different XML specification files can define the same tag in different ways (homonyms) or the same concept with different tags (synonyms). Each element may contain a combination of data values and other elements. A hierarchy is created by building elements that contain other elements. Also, elements may be re-used as long as a recursive structure is avoided. XML also allows elements from one specification to be referenced by other XML specifications, further building consistency. Small patterns (such as elements that contain only a data value) can be re-used by copying as there is less risk of maintenance-induced inconsistencies. Larger data patterns (such as location on the surface of the Earth) can be re-used by reference to a designated XML specification. Differences between XML specification languages mostly appear in the definition of constraints and representation of individual data values within an entity. Some common representations useful for bulk data (e.g., data tables, BLOBs, etc.) are not yet supported in XML, although these may be described value-by-value or in attached files (as in WellLogML) currently. As the standards for XML specification languages proceed, it is not unreasonable to expect that these capabilities will be added. Standardization based on ElementsThough not perfect, data pattern standardization is best measured in units of XML elements. Each element represents the statement, in XML, of a (proposed or accepted) standard data pattern. Each provides a manageable semantic package that can be labeled and defined, and to which business rules or handling instructions can be attached. In a perfect environment, an XML element would only be defined once in a community of users, and its re-use in many XML specifications would ensure the semantic consistency of data within all their XML data files. As XML specifications are now being created and managed in distinct files, the widespread standardization of elements would require a change of emphasis among XML developers (yielding some autonomy), but this change will not be quick or easy. Such a shift does not need to be universal as an individual community of users may create local conventions which do not compromise the basic standards or tools. |
Proposed Architectural ComponentsThis discussion has not focused on specific advantages of XML, but an XML-based solution is highly desirable technically and virtually required otherwise[2]. |
Requirements
|
Description of ComponentsThe components required to meet most of these requirements are currently available within XML technology. Only the manner in which they are used is the content of this proposal. The concepts of multiple XML specification levels linked by XSL Transformations[3] have been proposed and tested previously[4,5]. Application XML SpecificationsAt the lowest level, XML specifications should be constructed that meet the specific data content and structure requirements of each application, application suite or database. All aspects of the XML specification (schema language used, tag style, hierarchy, use of attributes and elements, etc.) should be designed to support the applications/databases that use it. This specification should be the responsibility of the application developer. The primary use of this specification is to insulate applications from other, more general XML specifications. Local XML SpecificationsAt a higher level, a second XML specification should be constructed to synthesize the data patterns of related application XML specifications into a single data structure. This specification should enable data interchange among the application XML specifications it is designed to support. This level of XML specification must address the integration issues and data conflicts for a local group of applications and databases. Secondly, this level of specification should incorporate data structures that support the data patterns established for the relevant subject material. Thirdly, the data patterns used in this specification should be compatible with the data patterns found in other local interchange XML specifications. The selection of other local interchange XML specifications to accommodate should be based upon the business processes used within the community of users being supported. Most aspects of these specifications (schema language used, tag style, hierarchy, use of attributes and elements, etc.) should be based upon guidelines and practices appropriate for the larger community of application developers it supports. This specification should be accessible to end-users engaged in data management activities. Also, this specification should be consistent with industry standard XML specifications to the greatest extent possible. Industry XML SpecificationsIndustry XML specifications should synthesize the data patterns found in the local XML specifications into a common data structure. At this level, addressing data management problems should be much more important than application-related issues. The primary role of the industry XML specifications is to support data interchange among as wide a population of end-users and application developers as possible. All aspects of these specifications should follow widely adopted conventions and guidelines (currently evolving). Also, multiple and overlapping specifications should be anticipated at this level, and every effort should be made to harmonize the data structures between these specifications. XSL TransformationsRecognizing three levels of XML specifications, one very important role of XSL Transformations (XSLT) is to convert XML data files to those compatible with other levels of specification. Each conversion is composed of two separate XSLT specifications: A-to-B and B-to-A. It should be the responsibility of those owning the lower level XML specification to provide and maintain the XSLT specifications necessary to accomplish this. In the case of transformations going from higher to lower levels, specific knowledge of the requirements of the lower level is needed. In the opposite transformation, any additional information required to complete the conversion is only available at the lower level. Additional XSLT specifications may be constructed to convert all or parts of XML data files to different specifications at the same level. These should be developed based upon commercial opportunities or specific needs. It should be recognized that these conversions will be incomplete when the data patterns addressed are inconsistent. Schema AdjunctsIn addition to XSL Transformations, the use of an XML file by an application may require additional information specific to that application for the appropriate XML specification. A component called a schema adjunct has been proposed to provide this information[6]. These extra specifications provide application specific intelligence (including code) that equips that application to correctly handle XML data files created by the XML specification the schema adjunct was created for, Public XML RepositoriesA necessary component of an XML architecture for data interchange is near-immediate access to the other components. Repositories have already been created for this purpose. One function provided by repositories is to preserve and present XML specifications for public use, allowing the recipient of an XML data file to access the specification file that defines it. In addition repositories may contain valuable XSLT specifications, sample files, processing guidelines and workflow guidelines. The unit of specification in repositories is the XML specification file, not an element within the file. Elements within each file are independent of elements in other files (except when directly referenced). Additional meta-data may need to be captured to fully manage elements as data patterns, but this can be approximated with naming (and namespace) conventions. This may be counter to some repository use guidelines, however[7]. |
Proposed ArchitectureA complete architecture for interchangeable data using XML has not been developed or tested. Further work is needed to define more precisely the characteristics of each component. The following diagram illustrates how the components above may be used. |

References |
|
| copyright© 2000 by POSC.
All rights reserved. originally published 7 June 2000 last updated 20 Feb 2001 |