Re: The Reference Concrete Syntax is not Current Practice (Was Re: Standards, Work Groups, and RealiThe Reference Concrete Syntax is not Current Practice (Was Re: Standards, Work Groups, and Reali

Daniel W. Connolly (connolly@beach.w3.org)
Mon, 25 Sep 95 17:40:24 EDT
In message <Pine.3.89.9509242308.A25843-0100000@alpha>, Arjun Ray writes:
>
>Can the Working Group make a definitive statement about the practicality
>of a "real SGML parser" contending with the *current practice* of HTML?

Err... I intend to write something up about this.

The HTML 2.0 spec defers to the SGML spec on lexical issues. But
does a certain amount of handwaving about "parsed token sequences,"
a term which is not really used/defined in the SGML spec.

I'm working on a collection of materials to get the lexical stuff
in HTML cleared up once and for all. Things like a conformance
test suite, reference code, and some sort of paper to explain this
stuff.

I expect the jist of the paper to be:

HTML 2.0 lexical syntax is the same as for "Basic SGML Documents"
(SGML std section 15.1.1), with the following exceptions
as application conventions:

* SHORTREF applies only to attributes, not to tags. More
precisely:
* no Emtpy Start-tags: <>, Unclosed Start tags: <abc<def>
nor NET-enabling Start tags: <em/foo/ (SGML 7.4.1)
* no Empty End-tags: </>, Unclosed End-tags: </foo<bar>,
nor Null End-tags: <em/foo/ (SGML 7.5.1)

* no internal document type declaration subset.
(SGML 11.1 production 110,112)

* no marked sections (SGML 10.4)

* no named character references (SGML 9.5)

* increased (essentially ignored) capacities and quantities

I hope to hack up SP to emit warnings for these, and run a validation
service based on this hack. (or incentivise someone else to do the
hack... :-)

And I'm working on a matching lex specification, since the grammar in
the SGML specification isn't sufficient to build a parser from it --
it's ambiguous, and it mixes all sorts of lexical levels together.

And I hope to build the validation suite collaboratively. For a preview,
take a look at:

http://www.w3.org/pub/WWW/MarkUp/html-test/submission.html

I've been working on some lexical test cases in:
http://www.w3.org/pub/WWW/MarkUp/html-test/lexical/

Inquiring minds might take a look at a preview of the reference
implementation, prototyped in perl:

http://www.w3.org/pub/WWW/MarkUp/html-test/lexical/test.pl
http://www.w3.org/pub/WWW/MarkUp/html-test/lexical/sgmllex.pl
http://www.w3.org/pub/WWW/MarkUp/html-test/lexical/Makefile

>In all the agonizing over content model violations, sight has been lost of
>the far more fundamental fact that current practice has been divorced
>from the Reference Concrete Syntax.

I'd agree it's time to bring this issue to the forefront again.

> HTML *as it is being used in practice*
>poses a *tokenization problem* for any SGML-compliant implementation.
>It's ad hoc parsing all the way, and to expect implementors today to
>consider SGML compliance *at the lexical level* is to ask competitive
>suicide of them, insofar as HTML is taken to *mean* current practice.

Well... I'm not asking the vendors to take HTML to mean current
practice to that extent. I really believe it's easier to deploy
conforming SGML parsers (or something close enough, given the above
simplifications) than it would be to get everybody to agree on a
formalism for the whole of current practice.

Look at the little sad-face on arena. It's only a little bit of code
to add. And I hope to give out a piece of code that folks will find
suitably re-usable :-)

>Does the Working Group have an estimate of the percentage of existing
>documents that can "conform" *only* to the parsing heuristic embodied in
>
><URL:
>ftp://ftp.ncsa.uiuc.edu/Mosaic/Unix/source/Mosaic-src/libhtmlw/HTMLparse.c>?

Well... the folks at OpenText could probably answer that question to
about 6 significant digits. My guestimate is 47.94567% But a more
important question is: which will get us to reliable interoperability
quicker: getting the document base to be SGML-conforming, or getting
all the deployed software to agree on a specification derived from
HTMLparse.c?

Web documents have a short half-life (those that don't change often
aren't worth much, are they?). Many of them are produced by machine.
Authoring tools are getting easier to use and more conforming.

I suggest we approach this education problem -- which is what is is --
from the top down: educate the implementors of browsers and authoring
tools, and the authors of HTML "how-to" documents. Let the consumer
community learn from them. That's why I stared the HTML 2.0
specification effort in the first place.

>And from that standard's perspective, how much of current practice is the
>Working group prepared to declare explicitly as non-conforming?

Well... with the publication of HTML 2.0, we've declared a whole lot
of stuff to be non-conforming. At the lexical level, I don't expect to
budge much. (At the content-model level, we're in need of a clean
extension mechanism.)

>But putting a significant fraction of the existing document base beyond
>the pale isn't half as relevant as the fact that there are implementations
>on which this fraction "works", and for people whose concerns are limited
>to that, what a standard actually stipulates counts for much less than
>the ability of the implementors to simply *claim* conformance.

Let's look at this carefully: it starts with _an_ implementation on
which this fraction works. A single implementation does not make for a
healthy market. Other companies have stepped in. Ask them if they're
happy that they've had to reverse engineer the behaviour of other
browsers. Ask them how much it cost. That cost is passed on. We all
pay.

>The Working Group should seriously consider declaring HTML qua Current
>Practice an unstandardizable hodgepodge.

Nawww... just stick around a while. Things will settle down.

Dan