[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: ISO 2022 versus UTF-8 autodetection heuristics




Markus -

> Bram Moolenaar wrote on 1999-11-03 12:38 UTC:
> > >   The ISO 2022 code for announcing UTF-8 is
> > > 
> > >     ESC %G
> > 
> > Hmm, this means that actual characters are used here.  The application must
> > know about this, to avoid that they are interpreted as ordinary text
> > characters.  That will make it more difficult for older programs, and can
> > break some things.  Escape codes can have nasty side effects when sent to a
> > terminal.
> 
> There exists a strict syntax for ESC codes specified in ECMA 35 and ECMA
> 48 (ISO 2022 and ISO 6429). This allows applications to reliably jump
> over ESC sequences that that do not know. In a nutshell, an ESC sequence
> starts with ESC and ends with a letter (see the standards for the
> precise details). This is widely implemented in terminal emulators (at
> least in the good ones where the authors read the standards ;-).

Read which standard?  This can't be the only one.  Why else would there be a
termcap/terminfo database with so many entries?

Anyway, I don't know a single application that ignores these escape sequences.
Try "grep %G" on the file that includes the ESC %G from above.
All programs I know just handle the escape sequences like normal text, they
are not ignored and not recognized.  Didn't try many programs, perhaps there
is an obvious one that does recognize them.

I would state that these escape sequences are not useful in a file.  They
could be useful when communicating with a terminal emulator though.  Is there
a termcap/terminfo entry that specifies that the terminal accepts these codes?

> > > The technique that mined98 uses seems to be fairly reliable. In
> > > practice, >98% of all ISO 8859 files contain malformed UTF-8 sequences
> > 
> > 98% isn't very reliable.  I would aim for 99.9% at least.
> 
> I said >98%, not =98%! It is very likely that it works for >99.99% of
> all files. It certainly will certainly detect for >>99.99% of all German
> ISO 8859-1 files that they are obviously not in UTF-8.

Well, why do you say >98% when you really mean >99.9%? :-)

> I challenge you to send me a orthographically correct sentence in one of
> the languages listed in the ISO 8859-1 standard, encoded in Latin-1,
> that does not contain a malformed UTF-8 sequence, i.e. which could not
> trivially be identified as not being UTF-8.

Ah, a challenge!  Well, here's one: OCÉ®  That's the name of the company I
used to work for with an (R) after it.  Almost any name can be followed by an
(R), thus this has quite a big change for being found in files.  Also, ¹²³ 
are likely to be used to refer to a footnote, which can also appear after many
of the start characters.

Need I continue?  Anyway, I have no idea how often these character
combinations occur, but they do exist.  When using another character set than
ISO 8859-1 the chance would be different.  Perhaps there is a specific set
with a high probability?  Perhaps there is some often used Polish word that
happens to be a valid UTF-8 sequence.  Hopefully there is an invalid sequence
in the same file to detect that it's not UTF-8 then.

I would say that these sequences do appear, but we don't know how often.
It can still be annoying though.  For example, files I made for Océ would
contain plain text (English or Dutch) and that OCÉ® sequence in the footer at
every page.  If this is recognized as UTF-8 it causess a mess.

--
hundred-and-one symptoms of being an internet addict:
81. At social functions you introduce your husband as "my domain server."

--/-/---- Bram Moolenaar ---- Bram@moolenaar.net ---- Bram@vim.org ---\-\--
  \ \    www.vim.org/iccf      www.moolenaar.net       www.vim.org    / /
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/