From enag@ifi.uio.no Tue Feb 15 15:27:43 1994 +0100 Message-Id: <19940215134758.3289.gyda.ifi.uio.no@ifi.uio.no> Date: Tue, 15 Feb 1994 15:27:43 +0100 From: enag@ifi.uio.no (Erik Naggum) Subject: Re: to be deleted. Status: RO X-Status: [Eliot Kimber] (1994-02-14 23:02:36 -0500) | I'm afraid that on this point there can be no compromise. If a | document is an SGML document then it *must* start with a DOCTYPE | declaration and include the document element *in the same entity*. The | definition of SGML document entity is quite clear on this. In fact, it | is impossible to know whether or not a given stream of data is valid | SGML *unless* there is a doctype declaration (and an SGML declaration, | which may be implied by the processing system). I beg to differ. This is a complicated issue, and I'm at work now, so I can't elaborate until sometime tonight (or this afternoon, EST), but the reason it has become complicated is that there has been a general failure to understand the distinction between what an SGML parser will see, and what was really there. Charles Goldfarb and I quickly came to the conclusion that the record boundary characters were figments of the SGML entity manager's interface, and we have worked hard to specify a mechanism that allows "storage objects" (a generalization of "file") to identify their record boundary convention (a generalization of "line terminator") such that the entity manager could do the right thing with them. We also took this argument further, realizing that the "entity" is not a file, or a string of characters "out there". It's a string of characters as seen by the parser. We allowed substrings of storage objects, concatentation of (substrings of) storage objects, and the reason we have "storage object" instead of "file" is the realization that the user needs the ability to identify the "storage manager" that can take a "storage object specification" and convert it to a string of characters. We provide two default storage managers: "file", and "literal". A user can thereby provide his own storage manager to read text from an in-memory buffer, from a network resource, from a database, from the execution of a program, etc. Conceptually, there is no limit to the number of transformations that could be applied to the storage objects before they were presented to the parser as the string of characters of an entity. As an extension of this idea was the realization that people work with one particular document type much more than they work with others. It would be a waste to parse the same DTD thousands of times a day, and we got the idea that a pre-parsed DTD could be stored in some way transparent to the user, which would be used by the SGML parser. This folds itself neatly into the idea of a resumable parser that stores enough state information that it can resume from any point in the parsing process. Right after the DTD parsing is just one example. The idea was that a parser client could parse up to a certain point, keep a "bookmark", and resume parsing from there if the text following this point changed. Well, apply this to a DTD, and the whole instance could change. I believe I have outlined a standards-conforming process that can be used to support the initial view that HTML+ need not include a DTD in every file (a reference would suffice), and need not parse the DTD itself (a pre- parsed version will do). Since the HTML+ application is restrictive, the number of document type declarations that will conform to the application is small, and can, for all practical uses, be limited to one, which is the one that all HTML+ processor implement. However, this is not really such a big deal. I have argued that validating a DTD is a different task than using it, and some parsers implement this distinction. Validating is _hard_. Parsing it to use it is relatively easy, and takes almost the same amount of resources required to read and process a binary format resulting from the pre-parsed DTD. (Barring tons of comments in the DTD, or lots of small files with DTD fragments.) If we also assume that HTML+ document authors validate their documents before they ship them (a not unfriendly requirement when you consider the alternative), parsing relative to a DTD is a relatively simple process. If done with something other than an SGML parser, however, it can be expensive, hard to get right, and terribly complicated. Therefore, an SGML parser should be used for this purpose. Whatever it is that actually does the job will be an "SGML parser", although probably not a _conforming_ SGML parser. Using a publicly available tool that can communicate with its client in the way I have outlined should offer some significant advantages. I also believe using POEM would solve many problems in retrieving files over the network, and would thus simplify the entire parsing process. Well, duty calls. I will have to continue later. Best regards, -- Erik Naggum | Memento, terrigena. ISO 8879 SGML, ISO 10744 HyTime, ISO 10646 UCS | Memento, vita brevis.