[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: HTML/XML parser





On Tue, 10 Jun 1997, Kasper Peeters wrote:

> 
> Hi there,
> 
> Last Saturday during the IRC session I volunteered to write the
> HTML/XML parser for Mnemonic. I've been studying the XML draft since
> then, and investigated various options.
> 
> My proposal is to build a scanner using PCCTS (the Perdue Compiler
> Construction Set). This scanner will only accept well-formed
> documents, which makes the scanner specifications fixed once the HTML
> version is specified or once the XML standard settles. For documents
> that are not well-formed (which is much more important for HTML
> documents), I suggest to build a converter that changes them, before
> handing them to the scanner, to well-formed documents. In this setup,
> any changes in the heuristics used to make sense out of the HTML mess
> on the web will not influence or clutter the real scanner.
> 
> The advantage of using PCCTS over flex/bison is that PCCTS is fully
> C++, and in general easier to maintain. It's freeware, and I have
> successfully compiled it on Linux, SunOS 4.1.2 and HPUX. It should
> also compile with various Windows compilers (not tested by me). It's
> reasonably small (under 250Kb, which includes many examples), and well
> documented (there's even a book, which is also available as PS file).
> Moreover, I already did a reasonably big project with it.
> 
<cut>
> 
> Kasper
> 
Hmm another package to install... I keep installing things to be able to
use mnemonic, i don't think that is good...

Another remark you are going to parse the HTML two times? one time to
correct it and one time by PCCTS. (you can only correct it if you know the
structure.) And the amount of correct HTML is VERY low. When you're able
to correct it you know already so much of the structure that you also can
generate the tags... 

Max