What is XML

XML is a text-based markup language that is fast becoming the standard for data interchange on the Web. As with HTML, you identify data using tags (identifiers enclosed in angle brackets, like this: <…>). Collectively, the tags are known as “markup”.

But unlike HTML, XML tags identify the data, rather than specifying how to display it. Where an HTML tag says something like “display this data in bold font” (<b>...</b>), an XML tag acts like a field name in your program. It puts a label on a piece of data that identifies it (for example: <message>...</message>).

Note:
Since identifying the data gives you some sense of what means (how to interpret it, what you should do with it), XML is sometimes described as a mechanism for specifying the semantics (meaning) of the data.

In the same way that you define the field names for a data structure, you are free to use any XML tags that make sense for a given application. Naturally, though, for multiple applications to use the same XML data, they have to agree on the tag names they intend to use.

Here is an example of some XML data you might use for a messaging application

<message>
 <to>you@yourAddress.com</to>
 <from>me@myAddress.com</from>
 <subject>XML Is Really Cool</subject>
 <text>
 How many ways is XML cool? Let me count the ways...
 </text>
</message>

Note: Throughout this tutorial, we use boldface text to highlight things we want to bring to your attention. XML does not require anything to be in bold!

The tags in this example identify the message as a whole, the destination and sender addresses, the subject, and the text of the message. As in HTML, the <to> tag has a matching end tag: </to>. The data between the tag and and its matching end tag defines an element of the XML data. Note, too, that the content of the <to> tag is entirely contained within the scope of the <message>..</message> tag. It is this ability for one tag to contain others that gives XML its ability to represent hierarchical data structures

Once again, as with HTML, whitespace is essentially irrelevant, so you can format the data for readability and yet still process it easily with a program. Unlike HTML, however, in XML you could easily search a data set for messages containing “cool” in the subject, because the XML tags identify the content of the data, rather than specifying its representation.

Tags and Attributes

Tags can also contain attributes — additional information included as part of the tag itself, within the tag’s angle brackets. The following example shows an email message structure that uses attributes for the “to”, “from”, and “subject” fields:

<message <strong>to="</strong>you@yourAddress.com<strong>" from=</strong>"me@myAddress.com<strong>"</strong> 
 <strong>subject="</strong>XML Is Really Cool<strong>"</strong>> 
 <text>
 How many ways is XML cool? Let me count the ways...
 </text>
</message>

As in HTML, the attribute name is followed by an equal sign and the attribute value, and multiple attributes are separated by spaces. Unlike HTML, however, in XML commas between attributes are not ignored — if present, they generate an error.

Since you could design a data structure like <message> equally well using either attributes or tags, it can take a considerable amount of thought to figure out which design is best for your purposes

Empty Tags

One really big difference between XML and HTML is that an XML document is always constrained to be well formed. There are several rules that determine when a document is well-formed, but one of the most important is that every tag has a closing tag. So, in XML, the </to> tag is not optional. The <to> element is never terminated by any tag other than </to>.

Note: Another important aspect of a well-formed document is that all tags are completely nested. So you can have <message>..<to>..</to>..</message>, but never <message>..<to>..</message>..</to>. A complete list of requirements is contained in the list of XML Frequently Asked Questions (FAQ) at http://www.ucc.ie/xml/#FAQ-VALIDWF. (This FAQ is on the w3c “Recommended Reading” list at http://www.w3.org/XML/.)

Sometimes, though, it makes sense to have a tag that stands by itself. For example, you might want to add a “flag” tag that marks message as important. A tag like that doesn’t enclose any content, so it’s known as an “empty tag”. You can create an empty tag by ending it with /> instead of >. For example, the following message contains such a tag:

<message to="you@yourAddress.com" from="me@myAddress.com" 
 subject="XML Is Really Cool">
 <strong><flag/></strong> 
 <text>
 How many ways is XML cool? Let me count the ways...
 </text>
</message>

Note: The empty tag saves you from having to code <flag></flag> in order to have a well-formed document. You can control which tags are allowed to be empty by creating a Document Type Definition,or DTD. We’ll talk about that in a few moments. If there is no DTD, then the document can contain any kinds of tags you want, as long as the document is well-formed.

Comments in XML Files

XML comments look just like HTML comments:

<message to="you@yourAddress.com" from="me@myAddress.com" 
 subject="XML Is Really Cool">
 <strong><!-- This is a comment --></strong>
 <text>
 How many ways is XML cool? Let me count the ways...
 </text>
</message>

The XML Prolog

To complete this journeyman’s introduction to XML, note that an XML file always starts with a prolog. The minimal prolog contains a declaration that identifies the document as an XML document, like this:

<?xml version="1.0"?>

The declaration may also contain additional information, like this:

<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>

The XML declaration is essentially the same as the HTML header, <html>, except that it uses <?..?> and it may contain the following attributes:

version
Identifies the version of the XML markup language used in the data. This attribute is not optional.
encoding
Identifies the character set used to encode the data. “ISO-8859-1” is “Latin-1” the Western European and English language character set. (The default is compressed Unicode: UTF-8.)
standalone
Tells whether or not this document references an external entity or an external data type specification (see below). If there are no external references, then “yes” is appropriate

The prolog can also contain definitions of entities (items that are inserted when you reference them from within the document) and specifications that tell which tags are valid in the document, both declared in a Document Type Definition that can be defined directly within the prolog, as well as with pointers to external specification files. But those are the subject of later tutorials. For more information on these and many other aspects of XML, see the Recommended Reading list of the w3c XML page at http://www.w3.org/XML/.

Note: The declaration is actually optional. But it’s a good idea to include it whenever you create an XML file. The declaration should have the version number, at a minimum, and ideally the encoding as well. That standard simplifies things if the XML standard is extended in the future, and if the data ever needs to be localized for different geographical regions.

Everything that comes after the XML prolog constitutes the document’s content.

Processing Instructions

An XML file can also contain processing instructions that give commands or information to an application that is processing the XML data. Processing instructions have the following format:

  &lt;?<em>target</em> <em>instructions</em>?&gt;

where the target is the name of the application that is expected to do the processing, and instructions is a string of characters that embodies the information or commands for the application to process.

Since the instructions are application specific, an XML file could have multiple processing instructions that tell different applications to do similar things, though in different ways. The XML file for a slideshow, for example, could have processing instructions that let the speaker specify a technical or executive-level version of the presentation. If multiple presentation programs were used, the program might need multiple versions of the processing instructions (although it would be nicer if such applications recognized standard instructions).

Note: The target name “xml” (in any combination of upper or lowercase letters) is reserved for XML standards. In one sense, the declaration is a processing instruction that fits that standard. (However, when you’re working with the parser later, you’ll see that the method for handling processing instructions never sees the declaration.)