Sorting utf-8 wordlists
#1
Hi!

On my Ubuntu VPS server, the locale is set to en_US.utf8, but when I use sort command on a custom language utf-8 character wordlist, all speacial characters like č get converted to c. It looks like a collation issue. What settings do I have to apply for this to work? Do I have to install and change my locale? That would be really bad. I tried to find a solution on Google but without success.

Thanks!
#2
how does the sort command you run look like?
#3
(06-12-2012, 01:16 AM)undeath Wrote: how does the sort command you run look like?

It is the standard unix sort.

I run it like this:

cat wordlist.txt | sort -u > sorted.txt
#4
cannot confirm.

Code:
[ undeath@p4home: /tmp ] % ~> cat test
öasdf
Ä‘hg4sb5t56
čwegver
Àsdrvgßsd
Ä‘hg4sb5t56
è weü46zgbe4z
[ undeath@p4home: /tmp ] % ~> sort -u test
Àsdrvgßsd
čwegver
Ä‘hg4sb5t56
è weü46zgbe4z
öasdf
[ undeath@p4home: /tmp ] % ~> echo $LANG,$LC_ALL
de_DE.UTF-8,de_DE.UTF-8
#5
Strange, I guess it's all about locale... I will post again if I encounter such problems.
#6
did you find a solution to this?

can you extract 10 example lines from your wordlist (which contain accents, umlauts, and other utf-8 unicode characters), run the commands as undeath has done and post the output here?

then, we can test the same on our *nix systems Smile
#7
please do not revive dead threads.
#8
Just wanted to know the solution and have some discussion around it.

Point noted, thank you.