SHA1 and UTF8
#1
I'm trying to get oclHashCat/cudaHashCat to find some known passwords, but it's not working. Here's an example:

Password: «T€$t»

$ echo -n «T€\$t» | sha1sum
db0822a82e3598764a2836008736987d8555f494

$ echo -n «T€\$t» | hexdump -C
00000000 c2 ab 54 e2 82 ac 24 74 c2 bb |..T...$t..|

So this is UTF8-encoded, and as you can see, some chars have 1-byte encodings, some have 2-byte encodings, and some have 3-byte encodings. The € symbol is encoded with 3 bytes, like this:

$ echo -n € | hexdump -C
00000000 e2 82 ac |...|
00000003

How can I specify the character set necessary to find this with hashcat? Shouldn't take too long, it's only 6 characters long.

The following does not work:

$ cudaHashcat64.bin -m 100 -a 3 --custom-charset1=«»Tt€\$ test.txt ?1?1?1?1?1?1

The following actually works, but if I have to specify all 2 and 3 byte characters as potential hex combinations, I'll be wasting a _lot_ of time trying invalid combinations:

$ cudaHashcat64.bin -m 100 -a 3 --custom-charset1=«»Tt€\$ test.txt ?1?1?1?1?1?1?1?1?1?1

To crack an 8-character long password with this method, I have to tell hashcat to try all combinations of up to 24 bytes long, which is unlikely to ever finish?
#2
A short back-of-the-envelope calculation:

For arguments sake, let's say my alphabet is a-zA-Z0-9€. If I need try crack an 8 character password that may contain any of these characters encoded in UTF8, the most straightforward approach seems to be to tell hashcat to try all combinations that are 8-24 bytes long, using a "charset" of 65 different bytes. That's 65^24 (+ 65^23 + 65^22 + ... + 62^8), which is completely intractable.

What I really want to do is crack all combinations that are 8 characters long, using a charset that contains 63 characters. That is 63^8 and takes a couple of days or so.

To avoid the worst case of 65^24 I guess I have to assume a maximum of just a few multichar bytes, split those into separate charsets, and create lots of different masks to place them in all possible positions. But since hashcat doesn't seem to be widechar aware, this will still be wasting time on invalid code points, especially in the case of 3 and 4 byte characters.

Or am I missing something? Does hashcat really not support wide characters?
#3
http://www.rurapenthe.me/2013/09/crackin...guage.html
#4
Yes, I had already read it. It's not a good solution, and when I read it it doesn't seem the author knows the difference between UTF8 and UNICODE. With UTF8, a single character is encoded in one, two, three or four bytes. An "a" is encoded just like the ASCII "a", but as I demonstrated, a "€" is encoded with the three byte sequence "e2 82 ac".

If all UTF8 bytes were encoded with a "base code" and a "character code" as in the blog you mentioned, the solution of something like ?1?2?1?2?1?2 for a 3 character password could be possible, as suggested. But how are you going to do this if the alphabet contains characters that are one, two or three bytes long, and you want to crack a password up to 8 characters?
#5
(06-01-2016, 12:19 PM)kefir Wrote: it doesn't seem the author knows the difference between UTF8 and UNICODE.

Unicode is a character set and UTF-8 is one of several encodings that can represent it. So "difference between them" is a weird notion.

(06-01-2016, 12:19 PM)kefir Wrote: If all UTF8 bytes were encoded with a "base code" and a "character code" as in the blog you mentioned, the solution of something like ?1?2?1?2?1?2 for a 3 character password could be possible, as suggested. But how are you going to do this if the alphabet contains characters that are one, two or three bytes long, and you want to crack a password up to 8 characters?

On the other hand, what would be a good solution? The hard part here is defining the syntax: You probably wouldn't want eg. ?s or ?S to represent all specials in Unicode. That would just end up in far too large keyspaces. JtR currently use the legacy notion of "codepage" which is fairly easy to understand (if your internal codepage is CP-1252 and you use ?S, it will include "€") but has some limits (you can't crack a password that includes characters from two different codepages). I would love a better solution but can't think of any.
#6
Apologies for any confusion caused, and to the author at rurapenthe.me, I see now that I had somewhat misinterpreted the article. But it still doesn't have a solution for passwords with characters of variable length as far as I can see. Even for a UTF-8 encoded german password that's mostly made up of a-z (one byte) but may include ß (two bytes), you'd have to run many rules to cover this.

Isn't the only sane approach here to support wide character encodings in hashcat, all the way from character entry through to the kernels on the GPUs? I'm certainly not interested in running with ?s or ?S to try all specials in Unicode (I've not used either of those before). I'm specifying the charset with --custom-charsetX=<chars>.

I'm just surprised that a couple of UTF-8 encoded password like abc€ß and €bß hashed with a single sha1 seem to cause this much trouble. Or please let me know how to crack the hashes bed208c08c74a2ea4c5b1b29ffd46bd799821326 and c45088d56cf608d841b241dd755c2bee6a684e11 with hashcat. I can do it, but it requires lots of rules, and probably lots of invalid UTF-8 characters. Here's how I generated the hashes, in a plain Ubuntu 14.04 terminal session:

$ echo -n abcۧ | hexdump -C
00000000 61 62 63 e2 82 ac c3 9f |abc.....|
$ echo -n abcۧ | sha1sum
bed208c08c74a2ea4c5b1b29ffd46bd799821326 -

$ echo -n €bß | hexdump -C
00000000 e2 82 ac 62 c3 9f |...b..|
$ echo -n €bß | sha1sum
c45088d56cf608d841b241dd755c2bee6a684e11 -

You mention jtr, I was hoping to have more success with that as my next attempt. I'd be happy to receive any pointers, will try to read up on charsets/utf8 in jtr today.
#7
The simple solution is to not use a mask attack.
#8
(06-02-2016, 12:07 PM)kefir Wrote: Isn't the only sane approach here to support wide character encodings in hashcat, all the way from character entry through to the kernels on the GPUs? I'm certainly not interested in running with ?s or ?S to try all specials in Unicode (I've not used either of those before). I'm specifying the charset with --custom-charsetX=<chars>.

That's the gist of it and you have lots of friends, me included. We know what we want but we're not quite sure what we do NOT want.

I've had ideas of making JtR 100% Unicode-aware - meaning, all representation of strings anywhere would be UTF-32 internally. But then it hit me it doesn't solve the MAIN problem at all, which is exactly the problem you raise here: Just HOW should we treat masks and stuff? That problem is not technical! We can't solve it by adding support way inside the inner workings. We need to DECIDE how things should work in practice. And I'm not anywhere near an answer.

(06-02-2016, 12:07 PM)kefir Wrote: I'm just surprised that a couple of UTF-8 encoded password like abc€ß and €bß hashed with a single sha1 seem to cause this much trouble. Or please let me know how to crack the hashes 

That very problem (and many other IRL problems) is covered by my "internal codepage" strategy. As much as I hate it, I can't see a better solution that doesn't get nasty. In JtR you'd just say "-internal-codepage=cp1252" and everything would be sweet.

(06-02-2016, 12:07 PM)kefir Wrote: You mention jtr, I was hoping to have more success with that as my next attempt. I'd be happy to receive any pointers, will try to read up on charsets/utf8  in jtr today.

I spent quite a lot of time trying to add real Unicode support to Hashcat but it ended up in the bin. I think I need to coordinate with Atom and other staff. This is complex shit.

magnum
#9
Regarding brute force: AFAIK hashcat is byte-oriented because of performance reasons - this is why it can only handle single-byte chars. Handling multi-byte chars would require completely different inner workings.

I had an idea of a new attack mode that would solve the utf8-brute force problem, but it seems that no one was interested so I didn't want to pollute the github tracker with this feature request: https://hashcat.net/forum/thread-5188.ht...light=UTF8

Anyway I do multi-byte bruteforce with programatically generating hcmask files, and using appropriate hex charset files (e.g. ?l?l?1?2?l?l style masks where ?1 has the interesting first bytes and ?2 the interesting 2nd bytes - for certain languages this works quite well, as there are only 1-2 first bytes). I haven't tried >2 byte chars though - although if you only want a few, it may be worth generating maskfiles with literal UTF 8 chars, e.g. ß?l?l?l, ?lß?l?l, ... Maskprocessor piped into some sed scripts is very useful for this purpose.
#10
While thinking a bit on it, I think that's also the way to go if we would add real multibyte support into hashcat. Otherwise you'd end up in too many branches. We also need to keep the same length for all elements in the same vector datatype.