[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: grep is horriby slow in UTF-8 locales
Markus Kuhn wrote:
On Red Hat 9:
$ grep --version
grep (GNU grep) 2.5.1
$ LC_ALL=en_GB.UTF-8 time grep XYZ test.txt
Command exited with non-zero status 1
6.83user 0.07system 0:06.93elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (157major+34minor)pagefaults 0swaps
$ LC_ALL=POSIX time grep XYZ test.txt
Command exited with non-zero status 1
0.07user 0.09system 0:00.16elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (125major+24minor)pagefaults 0swaps
where test.tx is just http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
repeated 10 times.
Wow, I dunno what's going on here. Here are the results on my system
(also RedHat 9):
$ grep --version
grep (GNU grep) 2.5.1
$ LC_ALL=en_GB.UTF-8 time grep XYZ test.txt
Command exited with non-zero status 1
1.14user 0.04system 0:01.19elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (156major+32minor)pagefaults 0swaps
$ LC_ALL=POSIX time grep XYZ test.txt
Command exited with non-zero status 1
0.01user 0.03system 0:00.03elapsed 102%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (125major+25minor)pagefaults 0swaps
It seems grep performs about 100x worse in a UTF-8 locale than in and
ASCII locale, even where the search strring contains no regex
metacharacters.
grep is slower on my system, but it doesn't appear to be as bad as on
your system.
In UTF-8 mode, grep is also much slower than the equivalent Perl:
$ LC_ALL=en_GB.UTF-8 time perl -ne '/XYZ/ && print' test.txt
1.49user 0.05system 0:01.55elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (339major+45minor)pagefaults 0swaps
$ LC_ALL=POSIX time perl -ne '/XYZ/ && print' test.txt
1.17user 0.09system 0:01.28elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (322major+45minor)pagefaults 0swaps
$ LC_ALL=en_GB.UTF-8 time perl -ne '/XYZ/ && print' test.txt
0.30user 0.01system 0:00.33elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (341major+45minor)pagefaults 0swaps
$ LC_ALL=POSIX time perl -ne '/XYZ/ && print' test.txt
0.19user 0.06system 0:00.24elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (325major+44minor)pagefaults 0swaps
Any suggestions? It would be nice not to be penalized like this by grep
for using a UTF-8 locale by default.
Sorry buddy, I have no idea :(
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/