[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: ISO 2022 versus UTF-8 autodetection heuristics
Markus -
> Bram Moolenaar wrote on 1999-11-03 12:38 UTC:
> > > The ISO 2022 code for announcing UTF-8 is
> > >
> > > ESC %G
> >
> > Hmm, this means that actual characters are used here. The application must
> > know about this, to avoid that they are interpreted as ordinary text
> > characters. That will make it more difficult for older programs, and can
> > break some things. Escape codes can have nasty side effects when sent to a
> > terminal.
>
> There exists a strict syntax for ESC codes specified in ECMA 35 and ECMA
> 48 (ISO 2022 and ISO 6429). This allows applications to reliably jump
> over ESC sequences that that do not know. In a nutshell, an ESC sequence
> starts with ESC and ends with a letter (see the standards for the
> precise details). This is widely implemented in terminal emulators (at
> least in the good ones where the authors read the standards ;-).
Read which standard? This can't be the only one. Why else would there be a
termcap/terminfo database with so many entries?
Anyway, I don't know a single application that ignores these escape sequences.
Try "grep %G" on the file that includes the ESC %G from above.
All programs I know just handle the escape sequences like normal text, they
are not ignored and not recognized. Didn't try many programs, perhaps there
is an obvious one that does recognize them.
I would state that these escape sequences are not useful in a file. They
could be useful when communicating with a terminal emulator though. Is there
a termcap/terminfo entry that specifies that the terminal accepts these codes?
> > > The technique that mined98 uses seems to be fairly reliable. In
> > > practice, >98% of all ISO 8859 files contain malformed UTF-8 sequences
> >
> > 98% isn't very reliable. I would aim for 99.9% at least.
>
> I said >98%, not =98%! It is very likely that it works for >99.99% of
> all files. It certainly will certainly detect for >>99.99% of all German
> ISO 8859-1 files that they are obviously not in UTF-8.
Well, why do you say >98% when you really mean >99.9%? :-)
> I challenge you to send me a orthographically correct sentence in one of
> the languages listed in the ISO 8859-1 standard, encoded in Latin-1,
> that does not contain a malformed UTF-8 sequence, i.e. which could not
> trivially be identified as not being UTF-8.
Ah, a challenge! Well, here's one: OCÉ® That's the name of the company I
used to work for with an (R) after it. Almost any name can be followed by an
(R), thus this has quite a big change for being found in files. Also, ¹²³
are likely to be used to refer to a footnote, which can also appear after many
of the start characters.
Need I continue? Anyway, I have no idea how often these character
combinations occur, but they do exist. When using another character set than
ISO 8859-1 the chance would be different. Perhaps there is a specific set
with a high probability? Perhaps there is some often used Polish word that
happens to be a valid UTF-8 sequence. Hopefully there is an invalid sequence
in the same file to detect that it's not UTF-8 then.
I would say that these sequences do appear, but we don't know how often.
It can still be annoying though. For example, files I made for Océ would
contain plain text (English or Dutch) and that OCÉ® sequence in the footer at
every page. If this is recognized as UTF-8 it causess a mess.
--
hundred-and-one symptoms of being an internet addict:
81. At social functions you introduce your husband as "my domain server."
--/-/---- Bram Moolenaar ---- Bram@moolenaar.net ---- Bram@vim.org ---\-\--
\ \ www.vim.org/iccf www.moolenaar.net www.vim.org / /
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/