The Smallest DocBook Big Picture

So you’re a writer, looking to make loads of money off your next best-seller. You fire up Microsoft Word to spend a few weeks drafting and crafting, carefully formatting your examples, properly aligning your figures and images, “bolding” this and italicizing that. Well into your hundredth page, your publisher contacts you regarding that draft you sent.

“This won’t work”, Bobby annoyingly says, “Didn’t we agree on a different line spacing? Oh, and your tables should be a little narrower, and this, and that, and…”. You sigh, knowing that you will be spending the next few days on coffee and cheap nicotine.

What happened there? Was it because you used Microsoft Word? Was it because you didn’t pay attention to Bobby’s every word? Nope. You mashed together what you want to say, with how you want it to look like, and then stuffed both in a proprietary format.

Single-Source Publishing

A book, an article, a documents in general consist of three layers of information:

  • Content, which is the flesh of the document, and the reason it exists.
  • Semantic information, or data about the content, such as the author, copyright information, list of figures and side notes.
  • Formatting information, which describes how content will look like according to semantic information.

Most text processors, don’t distinguish between semantic and formatting information. For instance, we usually use bold to emphasize a phrase. We know bold means emphasized, but text processors have no idea. They know how bold looks like, but not what it means.

Formatting information also change in different types of media. For example, a link to a website wouldn’t normally be displayed on a web page, because we can turn any word into a link. However, asking the user to refer to a website through a book is a different story.

DocBook, can distinguish between the above-mentioned layers of information, and is able to generate different media outputs using the same DocBook document. So now, instead of working on a Word version for my copy-editor, an HTML version for the Web, and a PDF for my publisher. I can use a single DocBook version, and tell DocBook what to do with it and how.

What is DocBook?

The DocBook site says it “is a schema (available in several languages including RELAX NG, SGML and XML DTDs, and W3C XML Schema) maintained by the DocBook Technical Committee of OASIS.”

To put it in plain English, DocBook is a list of markup elements suitable for writing Documents and Books, hence the name. It’s like HTML, in a sense that it has tags and uses plain text files, but unlike HTML, it’s more suitable for writing books. DocBook has no idea how your final document will look like. That’s the job of the DocBook toolkit.

When we want to start a new paragraph in HTML, we use a <p>paragraph</p> tag. The browser knows how to display a paragraph tag, it knows that it should insert a margin above and below. DocBook, following the same concept, uses a <para> tag.

DocBook uses a toolkit to read DocBook files and generate final documents. A toolkit has tools to validate DocBook files, apply formatting, import images, according to a set of rules defined in stylesheets.

Schema Jungle

The first version of DocBook was based on SGML. SGML is the mother of all markup languages. HTML, XML, SVG, MathML, VRML, and most of the markups we use, are subsets of SGML.

Later on, DocBook people realized that SGML isn’t a very viable option. Well it is, but it complicates authors’ lives. So an XML version was created, and the only difference between SGML and XML versions is the toolkit, and in most cases, both SGML and XML are interchangeable.

When you start reading more about DocBook, you’ll be bumping into all sorts of geeky acronyms. Relax NG, DTD and XML Schema are the most frequent ones. The bad news is that these three have no resemblance whatsoever, and they’re frequently mentioned. The good news is it’s none of your business. You don’t need to learn any of them. I promise.

Relax NG, DTD and XML Schema are three different ways of expressing the same thing: the list of valid DocBook elements. These different versions exist for technical reasons, for toolkit developers and DocBook validators. As an author, you should be basking at Starbucks, crafting your masterpiece, instead of worrying about TLAs (Three Letter Acronyms).

DocBook Versions

When DocBook mentions versions, they mean the schema version, not the toolkit versions. Different versions of DocBook have different markup elements. Some were added in version 4.0, but deprecated in 5.0. Some exist only in minor version like 4.5, but not 4.1.

It used to be more confusing a couple of years ago, because customization is DocBook’s best and worst feature. O’Reilly, for example, created a subset called DocBookLite, which was suitable for writing their books. Simplified DocBook gets rid of elements related to book-writing, and leaves you with a subset suitable for technical documents. Other schemas extend DocBook and add more markup elements.

The most prominent DocBook customization is Simplified DocBook, which you should take a look at if you’re writing white papers not books. But the best news is that you only need to know one DocBook version. That’s right, only one. Just one. Don’t go out of your way to write a book. Read the Definitive Guide for version 4.x, pay attention to what’s changed between minor 4.x releases, and you’re set. You can convert 4.x to 5.x. Of course next year I would have recommended you learn version 5.x, but it’s not final yet, so I’d wait.

If you’re eyeing O’Reilly, keep in mind that they’ve deprecated DocBookLite in favor of the full DocBook schema. So make sure you’ve read more about DocBook before you jump contact them.

The DocBook Toolkit

When it’s time to make a book from your DocBook sources, things tend to get messy. Since DocBook is just an XML file with restrictions, there are dozens upon dozens of different tools you can use. However, there two main toolkits.

Toolkits are a combination of commonly-used tools, which validate your DocBook sources, apply stylesheets to them, and generate final, publishable versions of your book.

DocBook stylesheets have nothing to do with CSS. These stylesheets describe how the toolkit should convert DocBook elements to the desired output. So in order to convert a DocBook file to HTML, the stylesheet would tell the toolkit to replace every <para> with <p>, and apply other conversions for the rest of the elements. Of course the same applies to a PDF-converting stylesheet, which would take a <para> tag and replace it with a paragraph boundry, sort of.

There are two types of stylesheets: DSSSL and XSL, used with SGML and XML versions respectively. SGML can be converted to XML, and OpenJade, the tool that works with DSSSL, has an XML adapter, so it will work with XML just as easily. If you are a developer though, you’re probably more comfortable with xsltproc, Saxon and Xalan, instead of Jade.

PDF generation seems to be the most troubled part of the whole DocBook process. DocBook’s stylesheets can’t generate PDFs directly. Instead, they generate XSL-FO documents, which are then used by an FO processor to generate PDFs. Explaining FO is outside the scope of this article, but fortunately you only need to read the first paragraph of the Wikipedia entry.

Apache FOP is most commonly used, but mostly because its free, and competing products are expensive.

Authoring Tools

DocBook has an awful lot of authoring tools, yet most of them require an expert, and they look funny for Windows users (I’m looking at you Emacs). Of all the editors I looked at, I found the best option to be XMLmind’s XML Editor, XXE for short. There’s a free version, and the commercial license isn’t insanely over-priced. It can be extensively customized, and you can get the source code when you buy a commercial license. Oh, and O’Reilly use it.

XXE Document Editor

More Information

I hope this article tickled your fancy, however odd that phrase might sound. If you’d like to learn more about DocBook, try to sift through Mark Johnson DocBookmarks, get brain-washed, and dip your toes into the details.



10 Responses (Add Your Comment)

  1. James Watson Oct 2, 2007
    at 7:17pm

    One problem I had with XMLMind was that it doesn’t allow entity references. IMO this removes a really useful feature, especially with respect to technical documentation.

    The other issue with XMLMind was that it doesn’t use any sort of pretty printing in the output and will remove any indentation from a source file that it modifies. This is annoying at the very least.

  2. I haven’t used XMLmind extensively yet, but I’ll look further into it. XMLmind isn’t as extensive as oXygen, for example. But it does what it’s supposed to do to a certain extent.

    Thanks for the tip James, I’ll try to find a workaround for those two issues.

  3. Richard Sullivan Oct 4, 2007
    at 2:32pm

    I evangelized using DOCBOOK in what was until then a Word-only environment. Without XMLMind I don’t think I could have brought other developers with me.

    For batch processing I used the Ant script provided by the aphid-docbook project – see http://sourceforge.net/projects/aphid-docbook/
    and I did some merging and manipulation of documents using Java and Groovy code.

    The aphid-docbook toolchain is probably not exactly “bleeding-edge” but it worked very well.

    I was hoping to embed SVG format graphics in HTML and PDF output but this didn’t work very well (mainly issues with the browsers rather than Docbook or Aphid issues). I found the opensource tools DIA and Inkscape very good but unfortunately had to export to pixel oriented formats rather than SVG.

    The only area where I wasn’t happy was in how FOP didn’t seem able to do anything reasonable with images that were too large for the page (I tried to get it to simply crop or scale the images but instead it simply threw an exception). Maybe newer versions of FOP can handle this issue – I think the version of FOP bundled with Aphid is quite old but I didn’t have the time or know-how to try to update it.

    Hope this info helps somebody,
    Richard

  4. Tony DaSilva Oct 4, 2007
    at 2:33pm

    Yes, support for entities would be nice, but XMLMind provides similar functionality through its “Copy as reference command.”

    In short, you create a document to contain your global references (product names, versions, etc.), wrap each item in an HTML span or DocBook phrase, give it an ID, and use “Copy as reference” to paste the item wherever you need it.

    This provides several benefits, two of which come to mind immediately:

    1. You can create and maintain these global references using XMLMind.
    2. You can share your XML variables between different documents with ease.

    I use XMLMind exclusively and employ this method all the time and with great success.

    Switching gears for a moment, these sorts of criticisms arise all the time from folks who have not invested much time looking into XMLMind. I regret that many people who would otherwise gain enormous benefits from using XMLMind wind up taking a pass. Curiously enough, I’ve observed over the years that the more experience one has with XML, the less attractive XMLMind seems.

    The makers of XMLMind target their product at writers, not coders or über markup geeks (although many swear by it). Bottom line on XMLMind – if you are creating narrative or topical content in XML, no other editor provides the range of functionality and value than XMLMind.

    Check out my review of XMLMind to learm more:
    http://www.winwriters.com/articles/xmlmind/index.html

    I do XML (DocBook and DITA, mostly) training and consulting for organizations and individuals looking to XML as a solution to their content development and management challenges. Feel free to contact me at mail”at”antonio-dasilva.com if you’re looking for assistance.

    All the best,
    Tony DaSilva

  5. Richard, I think the best build environment I’ve seen for DocBook is the one with Dive Into Python. Take a look at the sources, Ant tasks and Python scripts. Mark Pilgrim’s done a great job.

    Tony, you’re right, XMLmind targets authors, not geeks. That’s why it was my choice to write any DocBook. I don’t want to be bothered with the markup details until the the review phase, and XMLmind takes my mind off memorizing DocBook tags.

  6. Maggie Leber Oct 6, 2007
    at 4:16pm

    I have found jEdit ( http:/www.jedit.org ) to be a reasonable lightweight tool for working with DocBook. I use the XML support, and build documents to their target formats (usually PDF, HTML, RTF) with Ant scripts.

    I guess that puts me clearly on the “geek” side of things. :-)

    I understand that OpenOffice has support for Simplified DocBook, but confess I haven’t actually used it.

  7. Marshall Jacobson Nov 13, 2008
    at 5:48am

    uzarb3ae8tiizdab

  8. That’s great! Thanks

  9. Wow, great weblog layout! How long have you been blogging for? you make blogging look simple. The overall appear of your internet web site is great, let alone the content!

Trackbacks:

Other Entries

Tweets from