X-Hobby: low-tide clam sexing Date: Tue, 16 Jun 1998 15:15:36 -0500 To: sgml-hl7@dudley.mc.duke.edu From: "W. Eliot Kimber" Subject: RE: XML Question on : Unqualified Element Names Archive-URI: http://www.mcis.duke.edu/standards/HL7/archive/sgml-hl7/19980616161831RE_XML_Question_on__Unqualified_Element_Names_.txt At 10:56 AM 6/16/98 -0700, Biron,Paul V wrote: >In the olden days (of SGML, prior to XML) there was no such thing as an >unqualified element name. Why, because every document had to have a >DTD, and among other things, the DTD served to "qualify" the element >names. DTDs do not qualify names--documents qualify names. It is a common (and understandable) misconception that DTDs have any value in terms of semantic definition. They do not. Proof of this is the fact that you can remove the DTD declarations from a well-formed document and still process it just fine. By this I mean: it is the fact that an SGML or XML document is a separate storage object, separate from all possible other documents, that you get a name space of element types. The presence or absence of a set of declarations for those element types is irrelevant. >With the advent of XML and its notion of well-formedness, there need not >be a DTD anymore. One major anticipated use of XML, is to allow a >system to compose a document out of "parts" of other documents (the >source document may have DTD's). It may be an anticipated use, but it's not actually a very good, useful, or recommended use if by "compose a document" you mean "syntactically combine documents". The reason is that when you combine documents syntactically then all the combined bits must, as a whole, meet the syntactic requirements for a single document. Not only is this impossible in the general case but it's completely unnecessary and leads to ridiculous things like the name space proposal. The processing of documents happens *after* parsing. It's no more difficult to process a set of related documents than it is to process a single document. Therefore, there's no need to create a single document from multiple documents *before* parsing. By doing the combining *after* parsing you avoid all issues of syntactic combination, including the need to distinguish elements from different name spaces, because you haven't removed the original document boundaries, which defined the name space distinctions in the first place. Or said another way: name spaces fail to solve a problem that doesn't need solving. >How do we combine both of these into a single document and still be able >to distinquish the different usages of ? With the techniques in >the W3C "Namespaces" draft recommendation >(http://www.w3c.org/TR/1998/WD-xml-names-19980327). But in fact name spaces do nothing of the sort: they simply serve to make names unique--they tell you nothing about the meaning of those names. The meaning of the names must still be determined by some governing schema, which name spaces don't provide. There is, for example, no guarantee that two elements with names from different name spaces are not in fact representing the same semantic. Or said another way: because documents are name spaces, you don't need anything like the name-space mechanism to ensure that names within documents are unique. Besides, name spaces have the very serious problem that they impose names on the authors of documents, which is in direct conflict with one of the basic principles of SGML (and, one presumes, XML): the owner of a document has complete control of that document. Name spaces also have the problem that you only get one name per element. If you want two, you're forced to either forgo the second or appeal to some other mechanism. Thus, even if we grant that name spaces *do* provide a way to map elements to semantic objects, then given that most data modeling and classification systems I'm aware of allow the same object to exist in multiple taxonomies at once (I am at once a human, a male, a US citizen, and a driver of MG automobiles), then it seems silly at best to use a mechanism that can only ever allow one mapping. To sum up: 1. There is no need to qualify names within documents because they are already qualified by the existence of the document itself. 2. At best, the taxonomic classification power of name spaces is so limited as to be applicable to, at best, a very narrow range of applications. Thus, it's hard to see how name spaces are even worth the effort of discussing, much less defining or using. A moment's reflection on the issue of multiple classification should make it clear that you cannot do it in any reasonable way using element type names. That leaves only two obvious alternatives: 1. Use attributes to associate elements with semantic objects within schemas 2. Impose classification onto elements by pointing at them. The second solution, while useful in some situations, is too costly for general use. By contrast, the first solution is simple, easy, and natural. Solution 1 is what SGML architectures do. SGML architectures are explicitly a formal mechanism for associating documents with one or more governing schemas and then mapping individual elements to objects in those schemas. It completely avoids issues of imposition of names onto documents because it doesn't affect element type names at all and provides facilities for mapping local attribute names to architecturally-defined names. In fact, the names used in the document are completely irrelevant to the processing of the document in terms of its governing architectures because of the name mapping. The use of architectures has two key components: 1. A formal declaration of the governing architecture (schema) as a named object in its own right (and not just as a set of declarations). 2. A formal, syntactic mapping of elements to named objects in the governing schema. Given these two pieces of information you can know everything you need to know about the meaning of a given element (limited only by the completeness of the specification and documentation of the governing architectures). For example, say I have a database of people and I want to relate my person objects to the schema that defines the base properties of people and also to a schema that defines the base properties of employees. I might do something like this: ... ... The parts of this document are: [1] -- Formal declarations of the architectures (schemas) the document is governed by. The declaration has three parts (there are more you can have): - The name attribute provides a local name for the schema. This name is used as an attribute of elements to define mapping to objects in the schema. - the public-id attribute specifies the name for the architecture (schema) as an object. It represents the entire set of rules and constraints defined by the schema. It can be expected to map to all the documentation and formal specifications there are, including formal schemas in UML, IDL, EXPRESS, etc. Because it is a named object, architectures also define name spaces of semantic objects. Thus, you get the same qualification effect you get from name spaces, plus you get explicit semantic mapping. - The dtd-public-id attribute gives the location (file name) of the XML-syntax DTD declarations for the architecture. These declarations enable the XML-specific processing of the document. They are not required for the document to be processed in general but do enable architecture- level XML validation. These two declarations indicate that the document is formally governed by two external schemas: one for people objects and one for employee objects. The document is also governed by the unnamed schema of the document itself, but there is no way to formally associate a document to its schema (reference to an external DTD subset doesn't do it for a number of reasons that should be obvious given my statements at the start of this post). [2] -- The start of the document instance. Note the "people" and "employees" attributes. These attributes indicate that the Population element is both a "set-of-people" and a "set-of-employees", where these two object types will be defined by their respective schemas. In the schema defined by the document itself, the element type "Population" represents a population of beings of various types (that is, living things of one sort or another). Of course, I could define a third architecture for beings and declare it as well--this would be the appropriate way to formally define the schema for the document's element types. [3] -- A being element. It is both a person and an employee. There might be other types of beings. Again, note the "people" and "employees" attributes used to define the mapping. [4] -- An element from the Employees schema (but not from the people schema) that relates two employees together to establish a manager-to-worker relationship. Note that this element is also classified within the XLink schema, which defines fundamental element types for representing hyperlinks. Because XLink cannot impose element type names on documents, it cannot depend on the use of name spaces and therefore is, by necessity, an architecture as well. [I have not formally defined the use of the XLink architecture because the XLink spec does not currently require it, although I personally think it should.] Note that I could not construct this example using name spaces. At best, I would have to choose one of the four schemas at work here (the document's base schema plus the three architectures) to impose on the element type names. But why bother when I'll have to use elements for everything else? Note also that I can add new schemas (that is, new classifications) at any time simply by adding the necessary mappings, which I can do either by modifying the start tags of the affected elements or by adding the DTD declarations necessary to define the mappings. Finally, I can use the same syntactic mechanism to relate one architecture to another in order to define taxonomic hierarchies. For example, say I want to formally define my "beings" schema and then define the "people" schema in terms of it. I would first define the "beings" schema, which I would do by documenting it in whatever ways I felt like (prose, formal schemas in UML, IDL, EXPRESS, etc.) and, normally, by creating some DTD declarations that define how the objects in the schema should be represented syntactically in XML: Now we can define the persons architectural DTD: Note that these declarations make it clear that a person is a kind of being. The person element must have an ID attribute because beings must have ID attributes because beings have identity within a population and "set-of-people" is a population of beings. These constraints could be formally defined using a constraint language like EXPRESS for example. Given these new declarations I could rework the original example like so without loss of information: ... ... Here I've reduced all the element type names to one- or two-character strings, yet my ability to interpret the data as a population of beings, as a set of people, as a set of employees, and as an XLink document is completely unaffected (in part because the mapping to beings is now through the mapping to people). I could make the mapping to beings explicit without indirection through the people architecture like so: ]> ... ... Here I've added two ATTLIST declarations with fixed attributes to define the mapping of the X and Z elements to their corresponding beings semantics. This is a form of markup minimization that is convenient when a mapping is constant for a given element type. I could just as easily have added the attributes to the instance. Note something interesting about this ability to completely change the element type names: it means I *could* syntactically combine two documents into a new one if I wanted to because I can use arbitrary element type names to avoid name collisions. Not that I would. But I could. Cheers, Eliot --
W. Eliot Kimber, Senior Consulting SGML Engineer ISOGEN International Corp. 2200 N. Lamar St., Suite 230, Dallas, TX 95202. 214.953.0004 www.isogen.com