[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
UTF-8 (Japanese) strings which look like ASCII strings
A user on one of the sites that I run has managed to create two
user accounts for themselves:
Yoshi
%EF%BC%B9%EF%BD%8F%EF%BD%93%EF%BD%88%EF%BD%89 (UTF-8 using URL encoding)
When rendered in a web browser they both appear as "Yoshi", but from
the point of view of my code and the database they are, of course,
different. I allow people to have unrestricted usernames rather than
restricting them to ASCII-printable-only characters because this makes
sense on a Japanese site.
The problem though is that this user cannot log in to the non-ASCII
account. Or at least they could do if I could explain in length what
has happened, and if they understood my explanation, but they
shouldn't have to do this to use a web site.
Is there a way to solve this? For example, is it feasible to work out
if a general UTF-8 string has a lossless representation in ASCII and
do this conversion? [Note in the second string above, it looks as if
the Japanese part of Unicode contains a second mapping of the Roman
character set, so presumably this is not a straightforward conversion]
Alternately (and I don't really want to do this) is it possible to
have an HTML form which accepts UTF-8 charset in most fields, but one
field is limited to ASCII-only?
Is it a good idea to allow unrestricted usernames in any case?
Rich.
--
Richard Jones. http://www.annexia.org/ http://www.j-london.com/
Merjis Ltd. http://www.merjis.com/ - improving website return on investment
MONOLITH is an advanced framework for writing web applications in C, easier
than using Perl & Java, much faster and smaller, reusable widget-based arch,
database-backed, discussion, chat, calendaring:
http://www.annexia.org/freeware/monolith/
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/