The Reference Concrete Syntax is not Current Practice (Was Re: Standards, Work Groups, and Reality Checks)

Arjun Ray (aray@pipeline.com)
Mon, 25 Sep 95 01:27:30 EDT
On Sat, 23 Sep 1995, Glenn Adams wrote:

> Date: Sat, 23 Sep 95 14:50:36 EDT
> From: amanda@intercon.com
>
> > So all I have to do to *totally* derail the standards process is put a
> > new tag (or change an old one) in a popular browser and fail to file a
> > DTD on it?
>
> Well, that's what the evidence so far seems to indicate, yes.
>
> I would not agree with the previously quoted statement. The handling
> of unknown tags is clearly specified by the current draft -- ignore them.
> Of course, this is a bit easier said than done in the context of using
> a real SGML parser -- it is possible though without any great difficulty.

Can the Working Group make a definitive statement about the practicality
of a "real SGML parser" contending with the *current practice* of HTML?

> On the other hand, the draft is weak on what to do with known tags that
> appear in contexts other than where they are permitted. It is this latter
> problem that is much more insidious.

I have trawled untold megabytes of the mail archives and have yet to find
a good discussion of the *really* insidious problem. I'm nearly convinced
that the SGML gestalt actively hinders its appreciation, by conflating
parsing with validation as a practical prescription, so that *categories*
of error fail to be distinguished.

Two cases prototypically off the point:

> Take for instance the CENTER tag
> employed by Netscape. They failed utterly to specify the content model
> or the context where this element is to be used, and, consequently, many
> documents use it willy-nilly and depend on the quirky parsing of Netscape
> to essentially treat format related tags as a toggle on a global formatting
> state. [...]
>
> What troubles me further, is the fact that Netscape glibly accepts things
> like:
>
> plain <B> bold <I> bold italic </B> bold?!? </I> plain
>
> This is madness!

And one much closer:

> If people are so concerned over the randomness of netscape security key seeds,
> then shouldn't they be concerned with the fact that the following document,
> due to someone forgetting to type a single '>' character, may end up
> killing someone!
>
> <title>Instructions for Patient Jane Doe</title>
> <p>
> <i>Warnings</i>
> <p>
> <b Do not inject this patient with penicillin. She will die!</b>

Consider a different "broken" version of this:

<title>Instructions for Patient Jane Doe</title>
<p>
<!-- Warnings here ---->
<i>Warnings</i>
<p>
<b> Do not inject this patient with penicillin. She will die!</b>
<!---- End Warnings -->

Consider the fact that using Netscape (or Mosaic) achieves the *desired*
result (the warning being displayed), but using a SGML-compliant browser
*could be fatal*. It will take just one such incident to "convince" a
hospital administator which browser -- and perhaps which *language* -- is
"better", and he'll have a fatality to prove it, standards notwithstanding.

The fact of the matter is that *current practice* deems an arbitrary
number of -'s in comment declarations perfectly acceptable as a prettifying
device. Current practice deems that after STAGO, the first occurence of
ISO 8859-1 code point #60 is TAGC regardless of context, and therefore
it's permissible to omit an ending quote for the last attribute value
literal in a start-tag. And so on.

In all the agonizing over content model violations, sight has been lost of
the far more fundamental fact that current practice has been divorced
from the Reference Concrete Syntax. HTML *as it is being used in practice*
poses a *tokenization problem* for any SGML-compliant implementation.
It's ad hoc parsing all the way, and to expect implementors today to
consider SGML compliance *at the lexical level* is to ask competitive
suicide of them, insofar as HTML is taken to *mean* current practice.

Does the Working Group have an estimate of the percentage of existing
documents that can "conform" *only* to the parsing heuristic embodied in

<URL:
ftp://ftp.ncsa.uiuc.edu/Mosaic/Unix/source/Mosaic-src/libhtmlw/HTMLparse.c>?

Is the Working Group prepared to make an explicit, definitive statement
about the compliance of this source code with the Reference Concrete
Syntax in, say,

<URL:ftp://ftp.ifi.uio.no/pub/SGML/productions>?

Will the Working Group concede that much of current practice conforms if
at all to implementations, and not to specifications, far less a standard?
And from that standard's perspective, how much of current practice is the
Working group prepared to declare explicitly as non-conforming?

But putting a significant fraction of the existing document base beyond
the pale isn't half as relevant as the fact that there are implementations
on which this fraction "works", and for people whose concerns are limited
to that, what a standard actually stipulates counts for much less than
the ability of the implementors to simply *claim* conformance.

And *that* is the fundamental issue. There are players in the game who
seek legitimation only. Will it be a meet outcome if all the good work of
this Working Group has the only substantive effect of legitimizing a
*name* for the benefit of those who, having secured the legitimacy and
cachet that comes from a putative association with an Internet Standard,
propose to ignore the *real* specifications?

The Working Group should seriously consider declaring HTML qua Current
Practice an unstandardizable hodgepodge. Leave it to Netscape, and
perhaps Microsoft, to have the wit or patience or discipline to concoct
specifications the IETF might accept. Delegitimize the name HTML, and
continue the good work!

Arjun Ray
(I speak for myself only.)