Like Hypertext Markup Language (HTML), Extensible Markup Language (XML) is a subset of Standardized General Markup Language (SGML) and has been designed specifically for use on the Web. XML is defined in the W3C Recommendation published by the World Wide Web Consortium. The latest version of this document is available at http://www.w3.org/TR/REC-xml.
XML is more complete and disciplined than HTML, and it is also a framework for creating markup languages -- it allows you to define your own application-oriented markup tags.
XML provides a set of rules for structuring data. Like HTML, XML uses tags and attributes, but the tags are used to delimit pieces of data, allowing the application that receives the data to interpret the meaning of each tag. These properties make XML particularly suitable for data interchange across applications, platforms, enterprises, and the Web. The data can be structured in a hierarchy that includes nesting.
An XML document is made up of declarations, elements, comments, character references, and processing instructions, indicated in the document by explicit markup.
The simple XML document that follows contains an XML declaration followed by the start tag of the root element, <d_dept_list>, nested row and column elements, and finally the end tag of the root element. The root element is the starting point for the XML processor.
<?xml version="1.0"> <d_dept_list> <d_dept_list_row> <dept_id>100</dept_id> <dept_name>R &D</dept_name> <dept_head_id>501</dept_head_id> </d_dept_list_row> ... </d_dept_list>
This section contains a brief overview of XML rules and syntax. For a good introduction to XML, see XML in 10 points at http://www.w3.org/XML/1999/XML-in-10-points. For more detailed information, see the W3C XML page at http://www.w3.org/XML/, the XML Cover Pages at http://xml.coverpages.org/xml.html, or one of the many books about XML.
An XML document must be valid, well-formed, or both.
Valid documents
To define a set of tags for use in a particular application, XML uses a separate document named a document type definition (DTD). A DTD states what tags are allowed in an XML document and defines rules for how those tags can be used in relation to each other. It defines the elements that are allowed in the language, the attributes each element can have, and the type of information each element can hold. Documents can be verified against a DTD to ensure that they follow all the rules of the language. A document that satisfies a DTD is said to be valid.
If a document uses a DTD, the DTD must immediately follow the declaration.
XML Schema provides an alternative mechanism for describing and validating XML data. It provides a richer set of datatypes than a DTD, as well as support for namespaces, including the ability to use prefixes in instance documents and accept unknown elements and attributes from known or unknown namespaces. For more information, see the W3C XML Schema page at http://www.w3.org/XML/Schema.
Well-formed documents
The second way to specify XML syntax is to assume that a document is using its language properly. XML provides a set of generic syntax rules that must be satisfied, and as long as a document satisfies these rules, it is said to be well-formed. All valid documents must be well-formed.
Processing well-formed documents is faster than processing valid documents because the parser does not have to verify against the DTD or XML schema. When valid documents are transmitted, the DTD or XML schema must also be transmitted if the receiver does not already possess it. Well-formed documents can be sent without other information.
XML documents should conform to a DTD or XML schema if they are going to be used by more than one application. If they are not valid, there is no way to guarantee that various applications will be able to understand each other.
There are a few more restrictions on XML than on HTML; they make parsing of XML simpler.
Tags cannot be omitted
Unlike HTML, XML does not allow you to omit tags. This guarantees that parsers know where elements end.
The following example is acceptable HTML, but not XML:
<table> <tr> <td>Dog</td> <td>Cat <td>Mouse </table>
To change this into well-formed XML, you need to add all the missing end tags:
<table> <tr> <td>Dog</td> <td>Cat</td> <td>Mouse</td> </tr> </table>
Representing empty elements
Empty elements cannot be represented in XML in the same way they are in HTML. An empty element is one that is not used to mark up data, so in HTML, there is no end tag. There are two ways to handle empty elements:
-
Place a dummy tag immediately after the start tag. For example:
<img href="picture.jpg"></img>
-
Use a slash character at the end of the initial tag:
<img href="picture.jpg"/>
This tells a parser that the element consists only of one tag.
XML is case sensitive
XML is case sensitive, which allows it to be used with non-Latin alphabets. You must ensure that letter case matches in start and end tags: <MyTag> and </Mytag> belong to two different elements.
White space
White space within tags in XML is unchanged by parsers.
All elements must be nested
All XML elements must be properly nested. All child elements must be closed before their parent elements close.
There are two major types of application programming interfaces (APIs) that can be used to parse XML:
-
Tree-based APIs map the XML document to a tree structure. The major tree-based API is the Document Object Model (DOM) maintained by W3C. A DOM parser is particularly useful if you are working with a deeply-nested document that must be traversed multiple times.
For more information about the DOM parser, see the W3C Document Object Model page at http://www.w3c.org/DOM.
-
Event-based APIs use callbacks to report events, such as the start and end of elements, to the calling application, and the application handles those events. These APIs provide faster, lower-level access to the XML and are most efficient when extracting data from an XML document in a single traversal.
For more information about the best-known event-driven parser, SAX (Simple API for XML), see the SAX page at http://sax.sourceforge.net/.
Xerces parser
InfoMaker includes software developed by the Apache Software Foundation (http://www.apache.org/). The XML services for reports are built on the Apache Xerces-C++ parser, which conforms to both DOM and SAX specifications. For more information about SAX, see the Xerces C++ Parser page at http://xerces.apache.org/xerces-c/index.html.