Adapted from Goldfarb, Charles F., "A Generalized Approach to Document Markup", SIGPLAN Notices, June 1981

Introduction to Generalized Markup

A.1 The Markup Process

Text processing and word processing systems typically require additional information to be interspersed among the natural text of the document being processed. This added information, called "markup", serves two purposes:

  1. Separating the logical elements of the document; and
  2. Specifying the processing functions to be performed on those elements.

In publishing systems, where formatting can be quite complex, the markup is usually done directly by the user, who has been specially trained for the task. In word processors, the formatters typically have less function, so the (more limited) markup can be generated without conscious effort by the user. As higher function printers become available at lower cost, however, the office workstation will have to provide more of the functionality of a publishing system, and "unconscious" markup will be possible for only a portion of office word processing.

It is therefore important to consider how the user of a high function system marks up a document. There are three distinct steps, although he may not perceive them as such.

  1. He first analyses the information structure and other attributes of the document; that is, he identifies each meaningful separate element, and characterizes it as a paragraph, heading, ordered list, footnote, or some other element type.
  2. He then determines, from memory or a style book, the processing instructions ("controls") that will produce the format desired for that type of element.
  3. Finally, he inserts the chosen controls into the text.

Here is how the first paragraph of this paper looks when marked up with controls in a typical text processing formatting language:

.SK 1
Text processing and word processing systems 
typically require additional information to 
be interspersed among the natural text of 
the document being processed.  This added 
information, called "markup", serves two 
purposes:
.TB 4
.OF 4
.SK 1
1.#Separating the logical elements of the 
document; and
.OF 4
.SK 1
2.#Specifying the processing functions to be 
performed on those elements.
.OF 0
.SK 1

The .SK,.TB, and .OF controls, respectively, cause the skipping of vertical space, the setting of a tab stop, and the offset, or "hanging indent", style of formatting. (The number sign (#) in each list item represents a tab code, which would otherwise not be visible.)

Procedural markup like this, however, has a number of disadvantages. For one thing, information about the document's attributes is usually lost. If the user decides, for example, to center both headings and figure captions when formatting, the "center" control will not indicate whether the text on which it operates is a heading or a caption. Therefore, if he wishes to use the document in an information retrieval application, search programs will be unable to distinguish headings -- which might be very significant in information content -- from the text of anything else that was centered.

Procedural markup is also inflexible. If the user decides to change the style of his document (perhaps because he is using a different output device), he will need to repeat the markup process to reflect the changes. This will prevent him, for example, from producing double-spaced draft copies on an inexpensive computer line printer while still obtaining a high quality finished copy on an expensive photocomposer. And if he wishes to seek competitive bids for the typesetting of his document, he will be restricted to those vendors that use the identical text processing system, unless he is willing to pay the cost of repeating the markup process.

Moreover, markup with control words can be time-consuming, error-prone, and require a high degree of operator training when complex typographic results are desired. This is true (albeit less so) even when a system allows defined procedures ("macros"), since these must be added to the user's vocabulary of primitive controls. The elegant and powerful TeX system[2], for example, which is widely used for mathematical typesetting, includes some 300 primitive controls and macros in its basic implementation.

These disadvantages of procedural markup are avoided by a markup scheme due to C.F.Goldfarb, E.J.Mosher, and R.A.Lorie[3,4]. It is called "generalized markup" because it does not restrict documents to a single application, formatting style, or processing system. Generalized markup is based on two postulates:

  1. Markup should describe a document's structure and other attributes rather than specify processing to be performed it, as descriptive markup need be done only once and will suffice for all future processing.
  2. Markup should be rigorous so that the techniques available for rigorously-defined objects like programs and data bases can be used for processing documents as well.

These postulates will be developed intuitively by examining the properties of this type of markup.

A.2 Descriptive Markup

With generalized markup, the markup process stops at the first step: the user locates each significant element of the document and marks it with the mnemonic name ("generic identifier") that he feels best characterizes it. The processing system associates the markup with processing instructions in a manner that will be described shortly.

A notation for generalized markup, known as the Standard Generalized Markup Language (SGML), has been developed by a Working Group of the International Organization for Standardization. Marked up in SGML, the start of this paper might look like this:

<p>
Text processing and word processing systems 
typically require additional information to 
be interspersed among the natural text of 
the document being processed.  This added 
information, called <q>markup</q>, serves 
two purposes:
<ol>
<li>Separating the logical elements of 
the document; and
<li>Specifying the processing functions 
to be performed on those elements.
</ol>

Each generic identifier (GI) is delimited by a less-than symbol (<) if it is at the start of an element, or by a less-than followed by solidus (</) if it is at the end. A greater-than symbol (>) separates a GI from any text that follows it. The mnemonics P, Q, OL, and LI stand, respectively, for the element types paragraph, quotation, ordered list, and list item. The combination of the GI and its delimiters is called a "start-tag" or an "end-tag", depending upon whether it identifies the start or the end of an element.

This example has some interesting properties:

  1. There are no quotation marks in the text; the processing for the quotation element generates them and will distinguish between opening and closing quotation marks if the output device permits.
  2. The comma that follows the quotation element is not actually part of it. Here, it was left outside the quotation marks during formatting, but it could just as easily have been brought inside were that style preferred.
  3. There are no sequence numbers for the ordered list items; they are generated during formatting.

The source text, in other words, contains only information; characters whose only role is to enhance the presentation are generated during processing.

If, as postulated, descriptive markup like this suffices for all processing, it must follow that the processing of a document is a function of its attributes. The way text is composed offers intuitive support for this premise. Such techniques as beginning chapters on a new page, italicizing emphasized phrases, and indenting lists, are employed to assist the reader's comprehension by emphasizing the structural attributes of the document and its elements.

From this analysis, a 3-step model of document processing can be constructed:

  1. Recognition: An attribute of the document is recognized, e.g., an element with a generic identifier of "footnote".
  2. Mapping: The attribute is associated with a processing function. The footnote GI, for example, could be associated with a procedure that prints footnotes at the bottom of the page or one that collects them at the end of the chapter.
  3. Processing: The chosen processing function is executed.

Text formatting programs conform to this model. They recognize such elements as words and sentences, primarily by interpreting spaces and punctuation as implicit markup. Mapping is usually via a branch table. Processing for words typically involves the word's width and testing for an overdrawn line; processing for sentences might cause space to be inserted between them.

In the case of low-level elements such as words and sentences the user is normally given little control over the processing, and almost none over the recognition. Some formatters offer more flexibility with respect to higher-level elements like paragraphs, while those with powerful macro languages can go so far as to support descriptive markup. In terms of the document processing model, the advantage of descriptive markup is that it permits the user to define attributes -- and therefore element types -- not known to the formatter and to specify the processing for them.

For example, the SGML sample just described includes the element types "ordered list" and "list item", in addition to the more common "paragraph". Built-in recognition and processing of such elements is unlikely. Instead, each will be recognized by its explicit markup and mapped to a procedure associated with it for the particular processing run. Both the procedure itself and the association with a GI would be expressed in the system's macro language. On other processing runs, or at different times in the same run, the association could be changed. The list items, for example, might be numbered in the body of a book but lettered in an appendix.

So far the discussion has addressed only a single attribute, the generic identifier, whose value characterizes an element's semantic role or purpose. Some descriptive markup schemes refer to markup as "generic coding", because the GI is the only attribute they recognize[5]. In generic coding schemes, recognition, mapping, and processing can be accomplished all at once by the simple device of using GIs as control procedure names. Different formats can then be obtained from the same markup by invoking a different set of homonymous procedures. This approach is effective enough that one notable implementation, the SCRIBE system, is able to prohibit procedural markup completely[1].

Generic coding is a considerable improvement over procedural markup in practical use, but it is conceptually insufficient. Documents are complex objects, and they have other attributes that a markup language must be capable of describing. For example, suppose that the user decides that his document is to include elements of a type called "figure" and that it must be possible to refer to individual figures by name. The markup for a particular figure known as "angelfig" could begin with this start-tag:

<fig id=angelfig>

"Fig", of course, stands for "figure", the value of the generic identifier attribute. The GI identifies the element as a member of a set of elements having the same role. In contrast, the "unique identifier" (ID) attribute distinguishes the element from all others, even those with the same GI. (It was unnecessary to say "GI=fig", as was done for ID, because in SGML it is understood that the first piece of markup for an element is the value of its GI.)

The GI and ID attributes are termed "primary" because every element can have them. There are also "secondary" attributes that are possessed only by certain element types. For example, if the user wanted some of the figures in his document to contain illustrations to be produced by an artist and added to the processed output, he could define an element type of "artwork". Because the size of the externally-generated artwork would be important, he might define artwork elements to have a secondary attribute, "depth". This would result in the following start-tag for a piece of artwork 24 picas deep:

<artwork depth=24p>

The markup for a figure would also have to describe its content. "Content" is, of course, a primary attribute, the one that the secondary attributes of an element describe. The content consists of an arrangement of other elements, each of which in turn may have other elements in its content, and so on until further division is impossible. One way in which SGML differs from generic coding schemes is in the conceptual and notational tools it provides for dealing with this hierarchical structure. These are based on the second generalized markup hypothesis, that markup can be rigorous.

A.3 Rigorous Markup

Assume that the content of the figure "angelfig" consists of two elements, a figure body and a figure caption. The figure body in turn contains an artwork element, while the content of the caption is text characters with no explicit markup. The markup for this figure could look like this:

<fig id=angelfig>
<figbody>
<artwork depth=24p>
</artwork>
</figbody>
<figcap>Three Angels Dancing
</figcap>
</fig>

The markup rigorously expresses the hierarchy by identifying the beginning and end of each element in classical left list order. No additional information is needed to interpret the structure, and it would be possible to implement support by the simple scheme of macro invocation discussed earlier. The price of this simplicity, though, is that an end-tag must be present for every element.

This price would be totally unacceptable had the user to enter all the tags himself. He knows that the start of a paragraph, for example, terminates the previous one, so he would be reluctant to go to the trouble and expense of entering an explicit end-tag for every single paragraph just to share his knowledge with the system. He would have equally strong feelings about other element types he might define himself, if they occurred with any great frequency.

With SGML, however, it is possible to omit much markup by advising the system about the structure and attributes of any type of element the user defines. This is done by creating a "document type definition", using a construct of the language called an "element declaration". While the markup in a document consists of descriptions of individual elements, a document type definition defines the set of all possible valid markup of a type of element.

An element declaration includes a description of the allowable content, normally expressed in a variant of regular expression notation. Suppose, for example, the user extends his definition of "figure" to permit the figure body to contain either artwork or certain kinds of textual elements. The element declaration might look like this:

<!--       ELEMENTS   MIN  CONTENT (EXCEPTIONS) -->
<!ELEMENT  fig        - -  (figbody, figcap?) >
<!ELEMENT  figbody    - O  (artwork | (p | ol | ul)+) >
<!ELEMENT  artwork    - O  EMPTY >
<!ELEMENT  figcap     - O  (#PCDATA) >

The first declaration means that a figure contains a figure body and, optionally, can contain a figure caption following the figure body. (The hyphens will be explained shortly.)

The second says the body can contain either artwork or an intermixed collection of paragraphs, ordered lists, and unordered lists. The "O" in the markup minimization field ("MIN") indicates that the body's end-tag can be omitted when it is unambiguously implied by the start of the following element. The preceding hyphen mean that the start-tag cannot be omitted.

The declaration for artwork defines it as having an empty content, as the art will be generated externally and pasted in. As there is no content in the document, there is no need for ending markup.

The final declaration defines a figure caption's content as 0 or more characters. A character is a terminal, incapable of further division. The "O" in the "MIN" field indicates the caption's end-tag can be omitted. In addition to the reasons already given, omission is possible when the end-tag is unambiguously implied by the end-tag of an element that contains the caption.

It is assumed that p, ol, and ul have been defined in other element declarations.

With this formal definition of figure elements available, the following markup for "angelfig" is now acceptable:

<fig id=angelfig>
<figbody>
<artwork depth=24p>
<figcap>Three Angels Dancing
</fig>

There has been a 40% reduction in markup, since the end-tags for three of the elements are no longer needed.

A document type definition also contains an "attribute definition list declaration" for each element that has attributes. The definition includes the possible values the attribute can have, and the default value if the attribute is optional and is not specified in the document.

Here are the attribute list declarations for "figure" and "artwork":

<!--       ELEMENTS   NAME      VALUE      DEFAULT -->
<!ATTLIST  fig        id        ID         #IMPLIED >
<!ATTLIST  artwork    depth     CDATA      #REQUIRED >

The declaration for figure indicates that it can have an ID attribute whose value must be a unique identifier name. The attribute is optional and does not have a default value if not specified.

In contrast, the depth attribute of the artwork element is required. Its value can be any character string.

Document type definitions have uses in addition to markup minimization. They can be used to validate the markup in a document before going to the expense of processing it, or to drive prompting dialogues for users unfamiliar with a document type. For example, a document entry application could read the description of a figure element and invoke procedures for each element type. The procedures would also enter the markup itself into the document being created.

The document type definition enables SGML to minimize the user's text entry effort without reliance on a "smart" editing program or word processor. This maximizes the portability of the document because it can be understood and revised by humans using any of the millions of existing "dumb" keyboards. Nonetheless, the type definition and the marked up document together still constitute the rigorously described document that machine processing requires.

A.4 Conclusion

Regardless of the degree of accuracy and flexibility in document description that generalized markup makes possible, the concern of the user who prepares documents for publication is still this: can the Standard Generalized Markup Language, or any descriptive markup scheme, achieve typographical results comparable to procedural markup? A recent publication by Prentice-Hall International[6] represents empirical corroboration of the generalized markup hypothesis in the context of this demanding practical question.

It is a textbook on software development containing hundreds of formulas in a symbolic notation devised by the author. Despite the typographic complexity of the material (many lines, for example, had a dozen or more font changes), no procedural markup was needed anywhere in the text of the book. It was marked up using a language that adheres to the principles of generalized markup but was less flexible and complete than the SGML[4].

The available procedures supported only computer output devices, which were adequate for the books' preliminary versions that were used as class notes. No consideration was given to typesetting until the book was accepted for publication, at which point its author balked at the time and effort to re-keyboard and proofread some 350 complex pages. He began searching for an alternative at the same time the author of this paper sought an experimental subject to validate the applicability of generalized markup to commercial publishing.

In due course both searches were successful, and an unusual project was begun. As the author's processor did not support photocomposers directly, procedures were written that created a source file with procedural markup for a separate typographic composition program. Formatting specifications were provided by the publisher, and no concessions were needed to accommodate the use of generalized markup, despite the marked up document having existed before the specifications.

The experiment was completed on time, and the publisher considers it a complete success[7]. The procedures, with some modification to the formatting style, have found additional use in the production of a variety of in-house publications.

Generalized markup, then, has both practical and academic benefits. In the publishing environment, it reduces the cost of markup, cuts lead times on book production, and offers maximum flexibility from the text data base. In the office, it permits interchange between different kinds of word processors, with varying functional abilities, and allows auxiliary "documents", such as mail log entries, to be derived automatically from the relevant elements of the principal document, such as a memo.

At the same time, SGML's rigorous descriptive markup makes text more accessible for computer analysis. While procedural markup (or no markup at all) leaves a document as a character string that has no form other than that which can be deduced from analysis of the document's meaning, generalized markup reduces a document to a regular expression in a known grammar. This permits established techniques of computational linguistics and compiler design to be applied to natural language processing and other document processing applications.

A.5 Acknowledgments

The author is indebted to E. J. Mosher, R. A. Lorie, T. I. Peterson, and A. J. Symonds -- his colleagues during the early development of generalized markup -- for their many contributions to the ideas presented in this paper, to N. R. Eisenberg for his collaboration in the design and development of the procedures used to validate the applicability of generalized markup to commercial publishing, and to C. B. Jones and Ron Decent for risking their favorite book on some new ideas.

A.6 Bibliography

1. B. K. Reid, "The Scribe Document Specification Language and its Compiler", Proceedings of the International Conference on Research and Trends in Document Preparation Systems, 59-62 (1981).

2. Donald E. Knuth, TAU EPSILON CHI, a system for technical text, American Mathematical Society, Providence, 1979.

3. C. F. Goldfarb, E. J. Mosher, and T. I. Peterson, "An Online System for Integrated Text Processing", Proceedings of the American Society for Information Science, 7, 147-150 (1970).

4. Charles F. Goldfarb, Document Composition Facility Generalized Markup Language: Concepts and Design Guide, Form No. SH20-9188-1, IBM Corporation, White Plains, 1984.

5. Charles Lightfoot, Generic Textual Element Identification -- A Primer, Graphic Communications Computer Association, Arlington, 1979.

6. C. B. Jones, Software Development: A Rigorous Approach, Prentice-Hall International, London, 1980.

7. Ron Decent, personal communication to the author (September 7, 1979).