Archived at Pineapplesoft
 ananas.org 
  The Pineapplesoft Link newsletter covered a wide range of technical topics, see the archived issues.
The newsletter was first emailed in 1998. In 2001 Benoît discontinued it in favour of professional writing for magazines.
The “May 1998” page was archived in 2003 to preserve the original content of May 1998.
 
  | Home | Contact | Site map | Writings | Open source software |  


 

Welcome to the fifth issue of Pineapplesoft Link.

This month, I will return to XML with a discussion of well-formed versus valid XML documents. The distinction between the two is easy and I will cover the implications, for you as an XML user.

I'd like to hear from you. Your opinions will help me improve the newsletter so please send your comments or suggestions to [address removed, the newsletter is no longer published thank you for your support] .

Also if you are a Java developer and you are based in Europe, I would like to hear from you: I am preparing an article on Java developments in Europe.

Pineapplesoft Link, May 98:
Well-formed and Valid XML Documents

Last month I promised you an article on XML. I envisioned an article on the multiple applications of XML. Indeed XML is a very flexible standard and it has been used by different groups to serve somewhat different objectives. I did write that article. But if you want to read it, you will have to visit developer.com (http://www.developer.com). I am told it will be published on Monday 4 May. It covers two different applications for XML:

  • document management;
  • standards.

It aims to clarify the relationship between standards like OSD or CDF and document management. This is usually an issue for new comers to XML: people with a background in document management are confused by its use in standards and vice versa.

Anyway I urge you to visit developer.com on Monday 4 May. Look for "The Two Faces of XML" under Tech Focus. For now, back to this month's topic: well-formed and valid documents.

Well-formed and Valid XML Documents

As you probably know XML and HTML are both applications of SGML, an international ISO standard. SGML introduced the concept of Document Type Declaration (DTD), a formal definition of the structure of a document. You can think of a DTD as a set of rules that describes what is allowed and what is not in an SGML document.

XML documents comes in two flavors: well-formed and valid. Well-formed documents are the least stringent: they simply require that all elements are cleanly nested. Valid documents, on the other hand, must include a DTD and adhere to it! A variety of XML tools, known as validating parsers, check the conformance of documents against their DTDs.

Clearly well-formed XML documents are similar to HTML documents. Indeed HTML documents never include a DTD. There is HTML DTD (published as part of the HTML standard) but, as an HTML user or author, you will never see it. The HTML DTD is supposed to be universally available and is therefore no included in documents, if only to reduce download times.

Valid documents, on the other hand, are akin to full-blow SGML documents. They carry the bulk of the DTD with them and this makes it possible to validate them.

When You Want a DTD

DTDs are essential for large document management projects. One thorny issue when managing large volumes of documents (up to several thousands of pages) is to enforce coherence. Large documentations are typically written by several authors, over a long period of time. Nevertheless, it is important for readers that they maintain enough coherence in style and tone. Unfortunately, when left on their own, authors tend to adopt different styles. The resulting documents are more a patchwork of individual pieces rather than coherent wholes.

For example, one writer may use many levels (part, chapter, section, sub-section, paragraph) where another will use them more sparingly. One writer may include abstracts and other navigational "hints" (icons, sidebar, etc.) where another one won't. This is not good for readers who have to adapt as they progress through the documentation.

To fight this problem, organizations use a manual of style or guidelines. The guidelines describe how documents should be organized, what navigation "hint" should be used, etc.

DTDs are the next logical step. A DTD is nothing more than a formalized guideline, i.e. it is a guideline in a format that is usable by a computer. There are several advantages to using formal guidelines. For one, the computer can validate documents automatically and reject those documents that are non-compliant. More importantly the computer can guide authors through the guidelines, e.g. by prompting for an abstract where one is required.

Many large documentation projects use SGML for this. With SGML-enabled editors or adapted versions of standard word processors, they effectively enforce a common style.

Valid XML document support essentially the same features but with all the others benefits of the XML, most importantly cheaper tools (SGML tools are notoriously expensive) and available expertise (there are very few SGML experts and in-house expertise is rare).

When You Don't Want a DTD

If, as we have seen, the DTD is essential for the production of large bodies of documentation but it is also a major waste of time for smaller documents. Writing a DTD is not an easy task, just like writing corporate guidelines is not easy.

For example, DTDs are overkill for letters, memos, faxes, etc. In most case, enforcing strict guidelines for these documents is not essential. You wouldn't want a DTD where you have no corporate guidelines or where they are limited to the use of stationery. Note that there are specific environments where a DTD would make sense for letters, e.g. when there is a legal obligation to record correspondence.

It is also important to realise that validating documents is an editorial activity, i.e. one that makes sense at the publisher but not at the user site. Therefore validating documents after they have been delivered is fruitless. Once documents have been accepted for publication, they should be correct. If they are not, it is not the user responsibility or fault.

In other words, there is no need for a DTD at the reader site. This is particularly relevant on the Internet where download times are an issue.

Well-formed XML documents exist specifically for those documents where developing a DTD would be overkilled or for Internet delivery of documents. In that sense, XML has many of the benefits from HTML but allows one to define one's own tags which is more flexible.

To DTD Or Not To DTD

There are compelling reasons to use DTDs and there are equally good reasons not to use them. Ultimately, it depends on the context: authors and editors may want DTDs, readers don't. Fortunately XML supports both modes and let you choose.

Best of all, the choice is not an all or nothing. It is possible to switch between valid and well-formed for a single document. It would make perfect sense to create documents, validate them against a DTD and publish them as well-formed, i.e. without the DTD.

This is another illustration of the flexibility of XML: the standard supports the best practices in document management and has been built for the real world where flexibility is essential.

Self-promotion department

After a very public first quarter, April was a quieter month. Except for a short "Introduction to CORBA" speech (inspired by last month's Pineapplesoft Link article) in London, I have not released any new documents. I did write two articles but they are not scheduled for release until later in May.

This apparent quietness is not real though, at Pineapplesoft we have focused on behind-the-scenes work. For example, we were again very active in electronic commerce with several consulting projects and through our active involvement in The XML/EDI Group.

Talking of XML/EDI, there will be a two hour conference on XML/EDI in Oslo, with Martin Bryan and myself on May 6. For further details, write to Gard Titlestad (gard.titlestad@kith.no).

About Pineapplesoft Link

Pineapplesoft Link is published freely, every month via email. The focus is on distributed and mobile computing, Java, XML, and other Internet technologies. The articles target people interested or concerned about technology either personally or professionally.

This issue of Pineapplesoft Link may be distributed freely via email, newsgroup or CompuServe forums provided it is not modified and this message and copyright notice is included. Distribution via other media (including the Web) is restricted. If you want to re-post Pineapplesoft Link regularly, please let me know.

Editor: Benoît Marchal
Publisher: Pineapplesoft sprl (www.psol.be)

Acknowledgements: thanks to Sean McLoughlin MBA for helping me with this issue.

Back issues are available at http://www.psol.be/old/1/newsletter/.

Although the editor and the publisher have used reasonable endeavors to ensure accuracy of the contents, they assume no responsibility for any error or omission that may appear in the document.

Last update: May 1998.
© 1998, Benoît Marchal. All rights reserved.
Design, XSL coding & photo: PineappleSoft OnLine.