 | Level: Introductory Joo Lima (joaolima@acm.org), Software Engineer
01 Mar 2002 This article proposes a simple, efficient, and flexible way to map XML data-centric documents to DB2 without using the XML Extender.
Abstract
The XML (eXtensible Markup Language) meta-language, despite its short lifetime, is already considered a key technology to the e-business world. Each day the amount of information that is exchanged in this new format and the demand for integration with the existing data increases. To cope with this demand, IBM has created the XML Extender for DB2®. Basically, the XML Extender provides two methods to integrate XML data to DB2: a column method and a collection method.
This article presents a different approach to the one used by the DB2 XML Extender. The presented solution is characterized by its simplicity, efficiency and flexibility. It is simple, because it uses only three tables to map XML data. It is efficient, as the indexes qualify for the choice of excellent access paths. And it is flexible, because through SQL statements it is possible to apply many operations on the XML data. The emphasis applied to the expression "XML data", in the title, is to emphasize that the proposed solution applies to data-centric XML documents only (and not document-centric).
 |
Introduction
The capacity of relational database management systems (RDBMS) to process great volumes of data is unquestionable. Conventional applications store "structured data" in tables. Indexes are defined to optimize queries and, optionally, to guarantee uniqueness. The optimization of SQL queries utilize sophisticated features, such as: query rewrite, multiple join methods, detailed statistics, parallelism etc. The evolution of RDBMS is due to the fact that they are based on a model formally defined by E. F. Codd [Codd1969]: the relational model. This theoretical model allows, for example, that relational algebra operations to be replaced by more efficient equivalent operations whereas not interfere with the result. The RDBMS optimizer accomplishes this substitution in an automatic way.
One of the main features of the XML standard is to allow data, usually considered 'not structured' or 'semi-structured', to be manipulated in a consistent and yet simpler form than the SGML (Standard Generalized Markup Language). Basically an XML document can be considered "well-formed" or "valid". A "valid" XML either contains a DTD (Document Type Definition) or references one, and the document conforms to the DTD. An "well-formed" XML document follows all the rules of the XML standard, but it either does not have a DTD or it fails to conform to it.
Mapping XML data to the relational model is not a trivial task (maybe one of the reasons for the variety of mapping solutions in the market).
According to Ronald Bourret[Bourret2001]:
"XML documents fall into two broad categories: data-centric and document-centric. Data-centric documents are those where XML is used as a data transport. They include sales orders, patient records, and scientific data. Their physical structure -- the order of sibling elements, whether data is stored in attributes or PCDATA-only elements, whether entities are used -- is often unimportant. (
) Document-centric documents are those in which XML is used for its SGML-like capabilities, such as in user's manuals, static Web pages, and marketing brochures. They are characterized by irregular structure and mixed content and their physical structure is important."
Our focus in this article deals with data-centric documents only.
Basic concepts of XML
The structure of an XML document is, by definition, a tree structure. There can be only one root node and all others nodes have to be organized in a hierarchical form. In Listing 1, we introduce an example of an XML document. The root node is the bibliography element. In the first line it's declared that it's an XML document and the following line indicates which DTD file will be used to validate the document. Figure 1 is a graphical representation of the XML document presented on Listing 1. The DTD defines elements order and their respective attributes (see Listing 2). For example: the book element is a composite of the authors or author elements and must be obligatorily followed by the elements title, publisher and optionally by the element isbn. It is important to notice that this DTD does not contain elements with MIXED or ANY content model. This is one of the features of XML data-centric documents that make easier the mapping and processing of the data.
Listing 1. XML data-centric sample document
<?xml version="1.0"?>
<!DOCTYPE bibliography SYSTEM "biblio.dtd">
<bibliography>
<entry id="Date2000">
<book year="2000" edition="2nd">
<authors>
<author>C. J. Date</author>
<author>Hugh Darwen</author>
</authors>
<title>Foundation for Future Database Systems -
The Third Manifesto</title>
<publisher>Addison Wesley</publisher>
<isbn>0-201-70928-7</isbn>
</book>
</entry>
<entry id="Codd1970">
<article year="1970">
<author>E. F. Codd</author>
<title>A Relational Model of Data for Large Shared Data Banks</title>
<journal>Communications of ACM</journal>
<volume>13</volume>
<pages>377--387</pages>
<issn>0001-0782</issn>
</article>
</entry>
</bibliography>
|
Listing 2. DTD of an XML data-centric sample document
<!ELEMENT bibliography ( entry )* >
<!ELEMENT entry ( article | book ) >
<!ATTLIST entry
id ID #REQUIRED >
<!ELEMENT article (( author | authors ), title, journal, volume?, pages?, issn? ) >
<!ATTLIST article
year CDATA #REQUIRED >
<!ELEMENT book (( author | authors ), title, publisher, isbn? ) >
<!ATTLIST book
year CDATA #REQUIRED
edition CDATA #IMPLIED >
<!ELEMENT authors ( author )* >
<!ELEMENT author ( #PCDATA ) >
<!ELEMENT title ( #PCDATA ) >
<!ELEMENT journal ( #PCDATA ) >
<!ELEMENT year ( #PCDATA ) >
<!ELEMENT volume ( #PCDATA ) >
<!ELEMENT pages ( #PCDATA ) >
<!ELEMENT issn ( #PCDATA ) >
<!ELEMENT publisher ( #PCDATA ) >
<!ELEMENT isbn ( #PCDATA ) >
|
Figure 1. Graphical presentation of an XML data-centric sample document
Presentation versus representation
This article addresses the following issue: how to represent the XML data, which has an inherently hierarchical structure, in a relational database where all data is presented in a tabular form. It is important to emphasize the distinction between presentation and representation. Chris Date [Date2000] is very emphatic in this issue:
"I am so tired of hearing claims to the effect that "the relational model is two-dimensional," or that relations are "flat" (and so on and so forth), while real data is "multidimensional." Anyone who thinks such claims constitute a valid criticism of the relational model clearly doesn't understand that model, and in fact is seriously confused. Of course it's true that a relation looks flat when pictured in tabular form on paper. But a picture of a thing isn't the thing! If a relation has n columns, then each row in that relation represents a point in n-dimensional space -- and the relation as a whole represents a set of such points. In other words, a relation of n columns is n-dimensional, not two-dimensional!"
Following the same thought, we intend to argue that it is possible to "represent" the hierarchical structure of the XML using the relational model. The data model that will be presented represents the tree structure using one of the approaches suggested by Joe Celko [Celko1995].
Proposed data model: Structure
The proposed data model is simple and is composed by only three relations:
-
Document
- One tuple for each XML document instance.
-
Element
- One tuple for each element of a document.
-
Attribute
- One tuple for each attribute of an element.
Figure 2 shows the Entity-Relationship diagram of the proposed solution.
Figure 2. Entity-Relationship diagram of the proposed solution
The Document table has the following attributes:
-
SEQ_XML
- Sequential number that identifies the XML document instance (primary key).
-
TEXT_XML
- Original content of XML document.
-
HEAD_XML
- Contains the declaration of an XML header that has information on encoding, version, etc.
-
DTD_REF_XML
- Contains a reference to the DTD file.
-
WELL_FORMED
- Indicates if an XML document is well-formed or not.
-
VALID_XML
- Indicates if an XML document is valid or not.
The Element table has the following attributes:
-
SEQ_XML
- Sequential number that identifies the XML document instance (primary key).
-
ORD_INITIAL
- Number generated by a tree-walking algorithm that identifies the element order in the tree before the visit to its subordinated elements.
-
TAG_NAME
- Element name.
-
ORD_FINAL
- Number generated by a tree-walking algorithm that identifies the element's order in the tree after the visit to its subordinated elements.
-
TAG_CONTENT
- Element's content, in case that it exists.
-
QTD_ATTRIBUTE
- Quantity of attributes that the element has.
The Attribute table has the following attributes:
-
SEQ_XML
- Sequential number that identifies the XML document instance (foreign key).
-
ORD_INITIAL
- Number generated by a tree-walking algorithm that identifies the element order in the tree before the visit to its subordinated elements (primary key).
-
ATT_SEQ
- Sequential number that identifies the attribute's sequence (primary key).
-
ATT_NAME
- Attribute's name.
-
TAG_CONTENT
- Element's content, in case that it exists.
-
ATT_VALUE
- Attribute's value.
The tables Element and Attribute store the elements and atributes content in VARCHAR columns. Depending on the application, this content may represent a number, a date, a string etc. The tables of the proposed model shouldn't be seen as the final destination of the data, but as a "data stage". After the parsing, validation and storage of the fragmented document, the data items will be retrieved according to each application and should be sent to its final destination where it shall receive conventional treatment.
To make sure that the mapping from the contents of the generic VARCHAR column to the destination column will not lead to inconsistencies, you can use in the parsing/validation process the resources available in the XML Schema standard. For example, suppose that an invalid date (30-feb-2001) is stored in an XML document. The DTD-only validation wouldn't detect the error (the invalid date would be stored in the VARCHAR column and the error would only be detected at the final mapping to a DATE type column). With XML Schema, this would be detected at parsing/validation time.
Figure 3 shows the graphical representation of an XML document with each element's ordered numbering. The picture below shows that each element possesses a pair of numbers: the number on the left is the ORD_INITIAL and the number on the right is the ORD_FINAL. In case the DOM (Document Object Model) is used, after parsing and validation (optional) of the XML document, a tree-walking algorithm can be applied. It is interesting to notice that the mathematical expression ((Element.ORD_FINAL - Element.ORD_INITIAL - 1) / 2) indicates the number of subordinated elements to the element.
Figure 3. Graphical presentation of an XML data-centric sample document with ordered numbering
The instance of the schema considered is shown in Figure 4 using the XML data sample document.
Figure 4. Table presentation of an XML data-centric sample document
Proposed data model: Operations
This modeling proposal allows many types of operations to be accomplished. In this article we will see three operations' groups:
Rebuilding the Original XML
To rebuild the original XML it is necessary to create a SELECT statement making use of the UNION ALL operator.
Listing 3 presents the SELECT statement that is made up by 4 subselect, as seen below:
- The first returns the main content: initial tag of the element, its attributes and content;
- The second returns the final tag of the element;
- The third returns the headline of the XML document.
- And the last returns the DTD line (if it exists).
Finally, the ORDER BY deals with ordering of the results according the original hierarchy by placing initial and final tags in their appropriate position.
Listing 3. Rebuilding the XML document with "seq_xml = 0"
SELECT CASE WHEN COALESCE(att_seq, 1) = 1
THEN '<' || RTRIM(e.tag_name)
ELSE '' END
|| CASE WHEN COALESCE(att_seq, 0) > 0
THEN ' ' || RTRIM(att_name) || '="' || RTRIM(att_value) || '"'
ELSE '' END
|| CASE WHEN qtd_attribute = COALESCE(att_seq, 0)
THEN '>'
ELSE '' END
|| COALESCE(RTRIM(tag_content), '') AS line,
e.ord_initial AS ord,
att_seq AS ord_att
FROM relxml.element e
LEFT OUTER JOIN
relxml.attribute a
ON e.seq_xml = a.seq_xml
AND e.ord_initial = a.ord_initial
WHERE e.seq_xml = 0
UNION ALL
SELECT '</' || RTRIM(tag_name) || '>' AS line,
ord_final AS ord,
0 AS ord_att
FROM relxml.element
WHERE seq_xml = 0
UNION ALL
SELECT RTRIM(head_xml) AS line, -1 AS ord, 0 AS ord_att
FROM relxml.document
WHERE seq_xml = 0
UNION ALL
SELECT RTRIM(dtd_ref_xml) AS line, 0 AS ord, 0 AS ord_att
FROM relxml.document
WHERE seq_xml = 0
ORDER BY ord, ord_att
|
It is also possible to return a determined subtree, in which case it will be necessary to indicate its root element. Listing 4 presents the SELECT statement that is required to return the desired subtree from the Book element (ord_initial = 3, ord_final = 16). Notice that the line that is responsible for the subtree filter is in bold.
Listing 4. Rebuilding a subtree from the original XML document.
SELECT CASE WHEN COALESCE(att_seq, 1) = 1
THEN '<' || RTRIM(e.tag_name
ELSE '' END
|| CASE WHEN COALESCE(att_seq, 0) > 0
THEN ' ' || RTRIM(att_name) || '="' || RTRIM(att_value) || '"'
ELSE '' END
|| CASE WHEN qtd_attribute = COALESCE(att_seq, 0)
THEN '>'
ELSE '' END
|| COALESCE(RTRIM(tag_content), '') AS line,
e.ord_initial AS ord
att_seq AS ord_att
FROM relxml.element e
LEFT OUTER JOIN
relxml.attribute a
ON e.seq_xml = a.seq_xml
AND e.ord_initial = a.ord_initial
WHERE e.seq_xml = 0
AND e.ord_initial >= 3 and e.ord_final <= 16
UNION ALL
SELECT '</' || rtrim(tag_name) || '>' AS line, ord_final AS ord, 0 AS ord_att
FROM relxml.element
WHERE seq_xml = 0
AND ord_initial >= 3 and ord_final <= 16
ORDER BY ord, ord_att
|
Given an element, it is possible to navigate to another element according to some features of the relationship between them. In this section we explore some possibilities of navigation using the proposed solution. We will verify the following cases of navigation:
The table below has two columns: the left column shows the operation, its parameters and the SELECT statement that implements the operation; the right column shows the figure that illustrates the navigation. In this figure, the arrow links the current node to those retrieved in each operation.
Table 1. Navigation between elements
a) retrieve root node
Parameter:
1 - Document number (seq_xml)
SELECT tag_name
FROM relxml.element
WHERE seq_xml = 0
AND ord_initial = 1
Result:
tag_name
------------
bibliography
| |
b) retrieve first child node
Parameters:
1 - Document number (seq_xml)
2 - Initial number (ord_initial)
SELECT tag_name, tag_content
FROM relxml.element
WHERE seq_xml = 0
AND ord_initial = 4 + 1
Result:
tag_name tag_content
-------- -----------
author C. J. Date
| |
c) retrieve next sibling node
Parameters:
1 - Document number (seq_xml)
2 - Initial number (ord_initial)
SELECT e2.tag_name, e2.tag_content
FROM relxml.element e,
relxml.element e2
WHERE e.seq_xml = 0
AND e.ord_initial = 5
AND e.seq_xml = e2.seq_xml
AND e.ord_final + 1 = e2.ord_initial
Result:
tag_name tag_content
-------- -----------
author Hugh Darwen
| |
d) retrieve previous sibling node
Parameters:
1 - Document number (seq_xml)
2 - Initial number (ord_initial)
SELECT e2.tag_name, e2.tag_content
FROM relxml.element e,
relxml.element e2
WHERE e.seq_xml = 0
AND e.ord_initial = 7
AND e.seq_xml = e2.seq_xml
AND e2.ord_final + 1 = e.ord_initial
Result:
tag_name tag_content
-------- -----------
author C. J.Date
| |
e) retrieve context;
Parameters:
1 - Document number (seq_xml)
2 - Initial number (ord_initial)
3 - Final number (ord_final)
SELECT e.tag_name, ord_initial
FROM relxml.element e
WHERE e.seq_xml = 0
AND e.ord_initial <= 7
AND e.ord_final >= 8
ORDER BY ord_initial DESC
Result:
tag_name ord_initial
------------ -----------
authors 4
book 3
entry 2
bibliography 1
| |
f) retrieve parent node
Parameters:
1 - Document number (seq_xml)
2 - Initial number (ord_initial)
3 - Final number (ord_final)
SELECT e.tag_name, e.tag_content
FROM relxml.element e
WHERE e.seq_xml = 0
AND e.ord_initial < 7
AND e.ord_final > 8
AND e.ord_initial =
(SELECT MAX(ord_initial)
FROM relxml.element e2
WHERE e2.seq_xml = e.seq_xml
AND e2.ord_initial < 7
AND e2.ord_final > 8)
Result:
tag_name tag_content
------------ -----------
authors
| |
g) retrieve last sibling node
Parameters:
1 - Document number (seq_xml)
2 - Initial number (ord_initial)
3 - Final number (ord_final)
SELECT e3.tag_name, e3.tag_content
FROM relxml.element e,
relxml.element e3
WHERE e.seq_xml = 0
AND e.seq_xml = e3.seq_xml
AND e.ord_final - 1 = e3.ord_final
AND 21 <> e3.ord_final
AND e.ord_initial < 20
AND e.ord_final > 21
AND e.ord_initial =
(SELECT MAX(ord_initial)
FROM relxml.element e2
WHERE e2.seq_xml = e.seq_xml
AND e2.ord_initial < 20
AND e2.ord_final > 21)
Result:
tag_name tag_content
------------ -----------
issn 0001-0782
| |
h) retrieve first sibling node
Parameters:
1 - Document number (seq_xml)
2 - Initial number (ord_initial)
3 - Final number (ord_final)
SELECT e3.tag_name, e3.tag_content
FROM relxml.element e,
relxml.element e3
WHERE e.seq_xml = 0
AND e.seq_xml = e3.seq_xml
AND e.ord_initial + 1 = e3.ord_initial
AND 30 <> e3.ord_initial
AND e.ord_initial < 30
AND e.ord_final > 31
AND e.ord_initial =
(SELECT MAX(ord_initial)
FROM relxml.element e2
WHERE e2.seq_xml = e.seq_xml
AND e2.ord_initial < 30
AND e2.ord_final > 31)
Result:
tag_name tag_content
------------ -----------
author E. F. Codd
| |
Notice that the operations presented in the table above always qualify an XML document instance, that is, the seq_xml attribute is always provided. The optimizer will choose the Index Scan Matching (Matchcols = 2) access path in the primary index of Element table, which provides an excellent response time.
Query operations (elements and attributes)
The proposed model allows that a query of a certain element or attribute will be done through a simple SELECT. Four types of queries are shown below:
The table below presents in the left column the type of operation, its parameters and corresponding SELECT statement. In the right column the result of the query is presented.
Table 2. Query operations on elements and attributes
a) retrieve elements based on an attribute's value.
Parameters:
1 - Document number (seq_xml)
2 - Attribute name (att_name)
3 - Attribute value (att_value)
SELECT e.tag_name,
e.ord_initial, e.ord_final
FROM relxml.element e,
relxml.attribute a
WHERE e.seq_xml = a.seq_xml
AND e.ord_initial = a.ord_initial
AND a.att_name = 'year'
AND a.att_value = '2000'
AND e.seq_xml = 0
|
Result:
Tag_name ord_initial ord_final
-------- ----------- ---------
book 3 16
|
b) retrieve elements based on an element's name.
Parameters:
1 - Document number (seq_xml)
2 - Element name (tag_name)
SELECT tag_content, ord_initial, ord_final
FROM relxml.element
WHERE seq_xml = 0
AND tag_name = 'author'
|
Result:
Tag_content ord_initial ord_final
----------- ----------- ---------
J. C. Date 5 6
Hugh Darwen 7 8
E. F. Codd 20 21
|
c) retrieve elements based on an element's content.
Parameters:
1 - Document number (seq_xml)
2 - Element name (tag_name)
3 - Element content (tag_content)
SELECT ord_initial, ord_final
FROM relxml.element
WHERE
tag_name = 'author'
and tag_content = 'E. F. Codd'
and seq_xml = 0
|
Result:
ord_initial ord_final
----------- ---------
20 21
|
d) retrieve leaf nodes.
Parameters:
1 - Document number (seq_xml)
SELECT ord_initial, ord_final,
tag_name, tag_content
FROM relxml.element
WHERE
seq_xml = 0
and ord_initial + 1 = ord_final
|
Result:
o_i o_f tag_name tag_content
--- --- -------- ------------
5 6 author C. J. Date
7 8 author Hugh Darwen
10 11 title Foundation for
Future ...
12 13 publisher Addisson
Wesley
14 15 isbn 0-201-70928-7
20 21 author E. F. Codd
22 23 title A Relational
Model of ...
24 25 journal Communications of ACM
26 27 volume 13
28 29 pages 377--387
30 31 issn 0001-0782
|
In the examples presented above, the instance of the XML document (seq_xml attribute) was always provided. This means that the query is executed in elements/attributes of a single document. However, it is possible to not inform the instance, and the query will be applied to all stored documents -- the scope of the query no longer is a tree, but a forest. In order to provide these queries a shorter response time, it's necessary to create some indexes:
- To optimize the queries "a" and "b" it is necessary to create a composite index on the columns (
att_name, att_value, seq_xml) in the Attribute table.
- To optimize the query "c" it is necessary to create a composite index on the columns (
tag_name, seq_xml) in the Element table.
Limitations
Some limitations of the proposed model are shown below:
- The proposed solution does not apply to the MIXED and ANY content models. These content models are frequently used in XML documents of the document-centric category. The proposed solution also applies to document-centric XML as long as the DTD doesn't allow for MIXED or ANY content.
- Updating the structure of the XML document (inclusion or exclusion of nodes) will require renumbering of many elements. This restriction can make impracticable this solution for cases where the documents have many elements and much update activity.
- In order to simplify the model, some features of the XML had, deliberately, not been contemplated, such as: comments, CDATA etc.
- We limited the element name to 100 bytes and its contents to 3000 bytes. These limits may be modified. For attributes, we have limited their name and value to 255 bytes each, to allow their participation in the creation of a composite index. Because the name and content of the attribute is indexable, it must favor, in the definition of the DTD, the modeling of a data item as an attribute (instead of an element), in case this data item is a frequent parameter of queries.
The limitation to the MIXED and ANY content models does not exist in a DOM based generic model, proposed by Richard Edwards [Edwards2000] and Sian Hope, for the mapping of XML documents to relational databases. Their schema uses fifteen tables and is more complete and complex than the one I am proposing. The persistent DOM solution allows for any type of XML document to be stored in a relational database.
Conclusion
The XML data mapping proposal for DB2 was developed and tested with success in a DB2 production environment. The proposal gathers the best of the two technologies: the flexibility of XML and the performance of DB2. An e-business solution has as an important issue: scalability. Therefore, it's important to bring together the performance of the RDBMS with the world's e-business current language: XML.
Resources
-
[Celko1995] Joe Celko, Unmatched node: cit , Morgan Kaufmann, 1995.
-
[Codd1969] E. F. Codd, "Derivability, Redundancy, and Consistency of Relations Stored in Large Data Banks", in IBM Research Report RJ599, 1969, San Jose, CA.
-
[Date2000] C. J. Date, "What do you mean, 'Post-Relational'?", in http://www.pgro.uk7.net/cjd1a.htm, June, 2000.
-
[Bourret2001] Ronald Bourret, "XML Database Products", in http://www.rpbourret.com/xml/XMLDatabaseProds.htm, August, 2001.
-
[Edwards2000] Richard Edwards and Sian Hope, "Persistent DOM: An architecture for XML repositories in relational databases", in Lecture Notes in Computer Science (vol. 1983), Springer-Verlag, pp416-421, 2000.
About the author  | 
|  |
Joo Alberto de Oliveira Lima began his career in software industry as a programmer/analyst in 1991. Since 1993 he has been involved with DB2 as developer, DBA and SQL performance analyst. The author has a BSc in Computer Science and a MSc in Software Engineering. Recently he has acquired the following certifications: IBM Certified Solutions Expert in "DB2 UDB V7.1 Database Administration for OS/390®", "DB2 UDB V7.1 Database Administration for UNIX®, Windows® and OS/2®" and "DB2 UDB V7.1 Family Application Development". He lives in Brasilia (Brazil) and can be reached at joaolima@acm.org.
Acknowledgments:
The author would like to gratefully acknowledge the comments and suggestions of Joo Batista de Holanda Neto, Omar Falco Soares, Ronald Bourret, and Berthold Reinwald.
|
Rate this page
|  |