[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

HTML/XML parser




Hi there,

Last Saturday during the IRC session I volunteered to write the
HTML/XML parser for Mnemonic. I've been studying the XML draft since
then, and investigated various options.

My proposal is to build a scanner using PCCTS (the Perdue Compiler
Construction Set). This scanner will only accept well-formed
documents, which makes the scanner specifications fixed once the HTML
version is specified or once the XML standard settles. For documents
that are not well-formed (which is much more important for HTML
documents), I suggest to build a converter that changes them, before
handing them to the scanner, to well-formed documents. In this setup,
any changes in the heuristics used to make sense out of the HTML mess
on the web will not influence or clutter the real scanner.

The advantage of using PCCTS over flex/bison is that PCCTS is fully
C++, and in general easier to maintain. It's freeware, and I have
successfully compiled it on Linux, SunOS 4.1.2 and HPUX. It should
also compile with various Windows compilers (not tested by me). It's
reasonably small (under 250Kb, which includes many examples), and well
documented (there's even a book, which is also available as PS file).
Moreover, I already did a reasonably big project with it.

I have considered using SP (a SGML parser by James Clarke, see the
SGML database for URL's), but it has a few disadvantages. One is that
I'm not very familiar with its internals, so it's hard to make any
extensions if necessary. Moreover, it is a full SGML parser, which is
a bit too much for our needs, and therefore probably bigger than
necessary (the source archive is around 500Kb, and it takes half an
hour or so to compile on my 90Mhz 16Mb Pentium, though this includes
some examples programs).

We could also code our own scanner/parser by hand. This has the
disadvantage that it takes a lot of time to do it right. This 
solution does allow us to construct a parser that handles both
well-formed and non well-formed documents in the same code, but
for the above reasons I do not think that doing that is a smart
thing.

If no-one complains within a day or two, I'll start coding. If anyone
has references to XML texts that are not available at w3.org or
through the SGML database but worth reading I'd like to know
(tutorials, examples and the like). Of course, if you want to join me
(marcellus@... ?), you're welcome.

I'll post some proposals on the internal tree structure in a few
days; I have to study the existing code in some more detail first.

This text together with other proposals and links to eg. PCCTS will be
stored at http://www.pvda.nl/~kasper/web/mnemonic/ .

Kasper