[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: DTD code




> Would someone be willing to give me a brief overview of the way the DTD
> code will connect with the parser and the rest of the system?

Aw'right, here we go:

HTML has the `nice' feature that elements do not always have to be
closed explicitly, eg.

   <ul>
   <li>one
   <li>two
   </ul>

is perfectly fine, you do not need the </li> items. This is defined in
the DTD for HTML:

  <!ELEMENT LI - O (%flow;)*             -- list item -->

The `-' means that the start tag is required (more on that in a
second), the `O' means that the end tag (ie. </li>) is
optional. Moreover, this ELEMENT definition tells us which elements
can exist as children of <li>: the elements defined by the entity
`flow'. This entity is defined as a composition of the `block' and
`inline' entities, each of which group a number of tags. For instance,
elements in the `block' group are <center>, <hr>, <div>, <dl> etc. The
important thing for our example is that <li> is NOT in the entity
`flow'. So <li> elements cannot be children of <li> elements.

So based on this information, the parser can conclude that

  1. the second <li> implicitly closes the first one, so it 
     really reads
 
            <ul>
            <li>one</li>
            <li>two
            </ul>

  2. The </ul> also closes the second <li>, resulting in

            <ul>
            <li>one</li>
            <li>two</li>
            </ul>

This, then, is a nice tree and can be represented in-memory by a DOM
tree.

There are some other rules in the DTD, but this is essentially all
there is to know (the most important additional thing is `implicit
open tags', which basically mean that you can leave out <head>/</head>
and just do <title>...</title>; the parser will insert the <head>
element around the title automatically). You can see how I represented
these rules in xmlparser.hh/xmlparser.cc and in particular htmldtd.cc.

The very first element of a HTML file will tell you the precise DTD
that applies, ie.

   <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">

refers us to the HTML 3.2 DTD. For the example above there is of
course no difference between html 3.2 and 4.0. This first element can
be used by the parser to request the proper DTD from the net. The dtd
would have to be parsed and converted into an internal structure as I
have it (handcoded) in htmldtd.cc right now.

Additional parts of the DTD can be `mixed' with the HTML code, like

  <!DOCTYPE kasper [
    <!ELEMENT kasper - - (UL | DL)>
  ]>

which defines a new element <kasper> which can contain a <ul> or a
<dl>. This could be added `on the fly' to the DTD that is being used.


For HTML, because of these implicit open and close tags, the parser
_needs_ the DTD to create the correct tree structure. Without it,
there is no way to convert the example at the beginning of this email
into the correct tree.

For XML, the w3c folks decided that this was too complicated, so XML
does NOT allow implicit tags. You would have to write the example as

  <ul>
  <li>one</li>
  <li>two</li>
  </ul>

Moreover, you cannot use a single <br> anymore (this tag is `empty' in
HTML, it does not need a </br>). Instead, you write <br/>. So you CAN
convert a XML document into a tree without any knowledge about the
DTD. For XML, the DTD just tells you which elements can occur as
children of other elements. So it can still tell us that <li> is not
allowed as a child of <li>, but that information is only in the DTD to
enable parsers to _validate_ your document. The tree structure is
unambiguous without the DTD.



Summary: for HTML, a DTD is _required_ to build the tree. In _general_
a DTD is required when you want to avoid illegal nesting of elements.

Since we do not want to bother the layout engine with illegally nested
DOM trees (ie. a <li> occurring as a child of <li>) it is very useful
to have the DTD information available. This way, the parser can make
sure that the layout engine won't go bananas about a DOM tree that it
cannot understand.

Hope that helps. Chris, let me know if you need technical details (we
could also do an old-style irc chat if you prefer). Note, by the way,
that at the moment, for immediate results, a CSS parser is more
urgently needed than a DTD parser.

Kasper
-

         -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
         Mnemonic Browser Project - http://www.mnemonic.browser.org/
                           Developers Mailinglist
             Archive: http://www.mnemonic.browser.org/list/dev/
         -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-