Masks for Multiple Language Charsets in UTF-8
#1
Hi all.

I've read pretty much everything i can find on the subject of masks and charsets, but can't find or work out a solution for this issue. For the record, the resource I most followed was: http://www.netmux.com/blog/ultimate-guid...-using-has, in concert with the FAQ and Wiki entries on custom character sets and masks.

I am trying to adapt the rockyou masks to support both the Russian and Basic Latin (English) character sets within the same password strings. The hashes were originally created on a system with UTF-8 encoding. From my understanding, the best (only?) way to create UTF-8 representation is to use --hex-charset, with -1 being the first byte range and -2 being the second byte range. For the record, I'm able to crack a password which uses ONLY the Russian language.

I've tried creating masks where ?1/2/3/4 are the literal characters, but it was unsuccessful in cracking any known passwords. (The cracking was done on an Ubuntu system with hashcat 4.x with UTF-8 as the locale/environment.) I've also tried cracking hashes of known passwords solely using Russian which were created on UTF-8 by using the built in Russian character sets, and that fails. (1 byte vs 2 bytes I'm assuming.)

Here is a mask which successfully cracks a 3 character (6 byte) Russian password when used with --hex-charset:
Code:
d0,808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9fa0a1a2a3a4a5a6a7a8a9aaabacadaeaf,d0d1,b0b1b2b3b4b5b6b7b8b9babbbcbdbebf808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9f,?1?2?3?4?3?4

The issue I'm encountering is that it appears that the Basic Latin character set in UTF-8 is encoded with only one byte. Therefore, a 2 bytes per character mask will not work. I used the same password cracked with the above mask, appended a Latin 's' (lower case s) to it, and updated the mask line to the following, hoping that addressing a Latin character with \x00\x## would work. It does not. It appears that for whatever reason in the combination of hashcat, hash environment, crack environment, and encoding specs, that "s" in UTF-8 is just \x73, not \x00\x73.

Code:
00d0,808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9fa0a1a2a3a4a5a6a7a8a9aaabacadaeaf4142434445464748494a4b4c4d4e4f505152535455565758595a,00d0d1,b0b1b2b3b4b5b6b7b8b9babbbcbdbebf808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9f6162636465666768696a6b6c6d6e6f707172737475767778797a,?1?2?3?4?3?4?3?4

(If need be for anyone, i can break down what is what within the mask. But I'm assuming anyone who knows enough to help answer the question also knows enough about character sets to be able to parse it for themselves if needed.)

And obviously, that Latin character could be anywhere inside the password, not just at the end, so the specific mask isn't the important part. 

So, I guess what my most direct question is, is this possible? Is it possible to setup a mask with a variable length, optional or dependent component? For example, using the mixed-language hex charset, is there a way to tell it to ignore the first ?1 if the next character in the mask will be ?2 between \x41 and \x5a? Or, a even a simple way of saying "some of these are one byte, and some are two bytes"? Or, some other workaround?  Also, if I'm entirely barking up the wrong tree with a core assumption here, please let me know. 

Any other thoughts on what i'm missing, or something else I should try?

Thanks in advance.
A guy named Lou.

(So, it looks like I rambled a bit. Please feel free to ask if you want clarification on anything.)

EDIT to Add - No, my test didn't work.
#2
Your findings are correct. UTF-8 is fully ASCII-compatible and latin characters (along with numbers and the basic set of special characters) are represented with only one byte.

As you already noticed, hashcat is oblivious of character encodings (except for --encoding-from/--encoding-to) and thus the issue of mulitbyte encodings is an open problem.

Of course you can construct masks that assume certain characters are two bytes while others are one, but you'll need a single mask for each possibility.
#3
(07-11-2018, 05:46 PM)undeath Wrote: Your findings are correct. UTF-8 is fully ASCII-compatible and latin characters (along with numbers and the basic set of special characters) are represented with only one byte.

As you already noticed, hashcat is oblivious of character encodings (except for --encoding-from/--encoding-to) and thus the issue of mulitbyte encodings is an open problem.

Of course you can construct masks that assume certain characters are two bytes while others are one, but you'll need a single mask for each possibility.

OK, thank you Undeath. Glad to know I wasn't missing something super obvious or misunderstanding how it all works. The creation of individual masks for each possible combination of 1 and 2 byte characters is... not appealing.

Future feature request? Wink
#4
You can do mask attack with UTF-8 like this:

Quote:root@ht:~/hashcat# echo -n ä | xxd
00000000: c3a4                                     ..
root@ht:~/hashcat# echo -n ä | iconv -f utf8 -t iso-8859-1 > umlaut
root@ht:~/hashcat# echo -n hällo | md5sum
38bbce5fb3d0aec801f9eab39182fa7a  -
root@ht:~/hashcat# ./hashcat -a 3 --stdout -1 umlaut -2 ?1?l ?2?2?2?2?2 | ./hashcat 38bbce5fb3d0aec801f9eab39182fa7a --quiet --encoding-from iso-8859-1 --encoding-to utf8    
38bbce5fb3d0aec801f9eab39182fa7a:hällo
root@ht:~/hashcat#