The Smallest DocBook Big Picture
October 2nd, 2007 • General
So you’re a writer, looking to make loads of money off your next best-seller. You fire up Microsoft Word to spend a few weeks drafting and crafting, carefully formatting your examples, properly aligning your figures and images, “bolding” this and italicizing that. Well into your hundredth page, your publisher contacts you regarding that draft you sent.
“This won’t work”, Bobby annoyingly says, “Didn’t we agree on a different line spacing? Oh, and your tables should be a little narrower, and this, and that, and…”. You sigh, knowing that you will be spending the next few days on coffee and cheap nicotine.
What happened there? Was it because you used Microsoft Word? Was it because you didn’t pay attention to Bobby’s every word? Nope. You mashed together what you want to say, with how you want it to look like, and then stuffed both in a proprietary format.
A book, an article, a documents in general consist of three layers of information:
- Content, which is the flesh of the document, and the reason it exists.
- Semantic information, or data about the content, such as the author, copyright information, list of figures and side notes.
- Formatting information, which describes how content will look like according to semantic information.
Most text processors, don’t distinguish between semantic and formatting information. For instance, we usually use bold to emphasize a phrase. We know bold means emphasized, but text processors have no idea. They know how bold looks like, but not what it means.
Formatting information also change in different types of media. For example, a link to a website wouldn’t normally be displayed on a web page, because we can turn any word into a link. However, asking the user to refer to a website through a book is a different story.
DocBook, can distinguish between the above-mentioned layers of information, and is able to generate different media outputs using the same DocBook document. So now, instead of working on a Word version for my copy-editor, an HTML version for the Web, and a PDF for my publisher. I can use a single DocBook version, and tell DocBook what to do with it and how.
What is DocBook?
The DocBook site says it “is a schema (available in several languages including RELAX NG, SGML and XML DTDs, and W3C XML Schema) maintained by the DocBook Technical Committee of OASIS.”
To put it in plain English, DocBook is a list of markup elements suitable for writing Documents and Books, hence the name. It’s like HTML, in a sense that it has tags and uses plain text files, but unlike HTML, it’s more suitable for writing books. DocBook has no idea how your final document will look like. That’s the job of the DocBook toolkit.
When we want to start a new paragraph in HTML, we use a
<p>paragraph</p> tag. The browser knows how to display a paragraph tag, it knows that it should insert a margin above and below. DocBook, following the same concept, uses a
DocBook uses a toolkit to read DocBook files and generate final documents. A toolkit has tools to validate DocBook files, apply formatting, import images, according to a set of rules defined in stylesheets.
The first version of DocBook was based on SGML. SGML is the mother of all markup languages. HTML, XML, SVG, MathML, VRML, and most of the markups we use, are subsets of SGML.
Later on, DocBook people realized that SGML isn’t a very viable option. Well it is, but it complicates authors’ lives. So an XML version was created, and the only difference between SGML and XML versions is the toolkit, and in most cases, both SGML and XML are interchangeable.
When you start reading more about DocBook, you’ll be bumping into all sorts of geeky acronyms. Relax NG, DTD and XML Schema are the most frequent ones. The bad news is that these three have no resemblance whatsoever, and they’re frequently mentioned. The good news is it’s none of your business. You don’t need to learn any of them. I promise.
Relax NG, DTD and XML Schema are three different ways of expressing the same thing: the list of valid DocBook elements. These different versions exist for technical reasons, for toolkit developers and DocBook validators. As an author, you should be basking at Starbucks, crafting your masterpiece, instead of worrying about TLAs (Three Letter Acronyms).
When DocBook mentions versions, they mean the schema version, not the toolkit versions. Different versions of DocBook have different markup elements. Some were added in version 4.0, but deprecated in 5.0. Some exist only in minor version like 4.5, but not 4.1.
It used to be more confusing a couple of years ago, because customization is DocBook’s best and worst feature. O’Reilly, for example, created a subset called DocBookLite, which was suitable for writing their books. Simplified DocBook gets rid of elements related to book-writing, and leaves you with a subset suitable for technical documents. Other schemas extend DocBook and add more markup elements.
The most prominent DocBook customization is Simplified DocBook, which you should take a look at if you’re writing white papers not books. But the best news is that you only need to know one DocBook version. That’s right, only one. Just one. Don’t go out of your way to write a book. Read the Definitive Guide for version 4.x, pay attention to what’s changed between minor 4.x releases, and you’re set. You can convert 4.x to 5.x. Of course next year I would have recommended you learn version 5.x, but it’s not final yet, so I’d wait.
If you’re eyeing O’Reilly, keep in mind that they’ve deprecated DocBookLite in favor of the full DocBook schema. So make sure you’ve read more about DocBook before you jump contact them.
The DocBook Toolkit
When it’s time to make a book from your DocBook sources, things tend to get messy. Since DocBook is just an XML file with restrictions, there are dozens upon dozens of different tools you can use. However, there two main toolkits.
Toolkits are a combination of commonly-used tools, which validate your DocBook sources, apply stylesheets to them, and generate final, publishable versions of your book.
DocBook stylesheets have nothing to do with CSS. These stylesheets describe how the toolkit should convert DocBook elements to the desired output. So in order to convert a DocBook file to HTML, the stylesheet would tell the toolkit to replace every
<p>, and apply other conversions for the rest of the elements. Of course the same applies to a PDF-converting stylesheet, which would take a
<para> tag and replace it with a paragraph boundry, sort of.
There are two types of stylesheets: DSSSL and XSL, used with SGML and XML versions respectively. SGML can be converted to XML, and OpenJade, the tool that works with DSSSL, has an XML adapter, so it will work with XML just as easily. If you are a developer though, you’re probably more comfortable with
xsltproc, Saxon and Xalan, instead of Jade.
PDF generation seems to be the most troubled part of the whole DocBook process. DocBook’s stylesheets can’t generate PDFs directly. Instead, they generate XSL-FO documents, which are then used by an FO processor to generate PDFs. Explaining FO is outside the scope of this article, but fortunately you only need to read the first paragraph of the Wikipedia entry.
DocBook has an awful lot of authoring tools, yet most of them require an expert, and they look funny for Windows users (I’m looking at you Emacs). Of all the editors I looked at, I found the best option to be XMLmind’s XML Editor, XXE for short. There’s a free version, and the commercial license isn’t insanely over-priced. It can be extensively customized, and you can get the source code when you buy a commercial license. Oh, and O’Reilly use it.
I hope this article tickled your fancy, however odd that phrase might sound. If you’d like to learn more about DocBook, try to sift through Mark Johnson DocBookmarks, get brain-washed, and dip your toes into the details.