Continuing characters always begin with binary "10". There is no chance for an illegal 5 byte sequence to be mistaken for an illegal 4byte sequence followed by an ascii character.Consider: parser 1 knows that a UTF-8 sequence can have at most 6 bytes, and sees an illegal 5-byte sequence.
Parser 2 knows that a UTF-8 sequence can have at most 4 bytes, and sees an illegal 4-byte sequence followed by an ASCII symbol.
Difference in interpretation of a byte sequence always has
security implications.
-- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/