Even theoretically possible, to bruteforce passwords that consists of a character map with 2^16 chars (that is UCS-2, UTF-16 can take more than two bytes per code-point as it is not fixed-length) is extending the time extremely. So UTF-16 must be expanded to UTF-32 first when you want to deal with this strictly. I have no proof for this, but I just assume that the trade-off for a hashing algo would be too immense to deal with the actual character encoding as well instead of just crunching bytes in memory at the expected, non-variant position. So if you say UTF-16 we might need to talk about UCS-2 and/or UTF-32 memory wise. Let's imagine what this means compared to normal ASCII or LATIN-1 passwords. For a password that consists of four characters (only, probably in Chinese this is much, but I have no idea). A simplified view could be:
You might now think that's possible to bruteforce a 8 character ASCII password. But keep in mind that while bruteforcing, you only take some character, like 0-9, a-z and A-Z (62 possibilities). For a multi-byte charset you most likely want to look for more chars per code-point (most likely more than 3 844 (62^2) possibilities per code point in a 16-Bit charset).
I don't think that's something someone is really after as it does not scale well with bruteforcing. I can imagine cases where this makes sense though, but most likely, this is "too much" for bruteforce.
But the number of possible combinations which plays a if not the foremost role in bruteforcing is not everything to consider for character encoding while cracking passwords. Hashcat supports dictionary and rules as well, so this is not bruteforcing only.
Dictionary Attacks
For dictionary attacks, this would mean the cracking app needs to support multi-byte charset dictionaries as well. This is because it makes no sense to crack multi-byte char hashes with a single-byte char dictionary, because the dictionary consists only of a fraction of words.
The only exception I see here is for UTF-8 but that's only because UTF-8 has ASCII-7bit as subset. And it's the only one I know of so far.
Rule Files and Interpreter
And then the rules. The current rule engine is implemented to work on single byte characters. The rule interpreter as well must support multi-byte charsets to properly work. That would need one additional version for two- and one for four-byte charsets.
Probably the rule language must be even adopted to support such charsets.
Attack Strategy: Single-byte encoded as Multi-byte
As atom already proposed, there is an exception to all this. While some systems store hashes for passwords that are multi-byte charset encoded, most users enter the password via their western keyboards so most likely the passwords consist of characters that can be easily represented with a single byte character set like ASCII or LATIN-1.
So to attack those even multi-byte encoded passwords, one option is to convert the multi-byte charsets into single-byte charsets first and then single-byte attack them while maintaining the byte/charset relationship in the hashing algorithm. To maintain the relationship is necessary otherwise the algorithm won't find the hashes properly. So this method has an overhead but the overhead is most likely smaller then having everything (hashes, hash-algos, dictionaries and rules) in a multi-byte variant.
I assume that's what atom already was talking about. That means a specific mode not only for the algorithm but also the variant of it for a specific charset. I think this method is efficient but it has the trade-off not being able to crack multi-byte character passwords that consist of characters not covered by single-byte charsets / dictionaries.
My 2 cents.
-- hakre
Code:
4 character UCS-2 password = 8 character ASCII/LATIN-1 password
4 character UTF-32 password = 16 character ASCII/LATIN-1 password
You might now think that's possible to bruteforce a 8 character ASCII password. But keep in mind that while bruteforcing, you only take some character, like 0-9, a-z and A-Z (62 possibilities). For a multi-byte charset you most likely want to look for more chars per code-point (most likely more than 3 844 (62^2) possibilities per code point in a 16-Bit charset).
I don't think that's something someone is really after as it does not scale well with bruteforcing. I can imagine cases where this makes sense though, but most likely, this is "too much" for bruteforce.
But the number of possible combinations which plays a if not the foremost role in bruteforcing is not everything to consider for character encoding while cracking passwords. Hashcat supports dictionary and rules as well, so this is not bruteforcing only.
Dictionary Attacks
For dictionary attacks, this would mean the cracking app needs to support multi-byte charset dictionaries as well. This is because it makes no sense to crack multi-byte char hashes with a single-byte char dictionary, because the dictionary consists only of a fraction of words.
The only exception I see here is for UTF-8 but that's only because UTF-8 has ASCII-7bit as subset. And it's the only one I know of so far.
Rule Files and Interpreter
And then the rules. The current rule engine is implemented to work on single byte characters. The rule interpreter as well must support multi-byte charsets to properly work. That would need one additional version for two- and one for four-byte charsets.
Probably the rule language must be even adopted to support such charsets.
Attack Strategy: Single-byte encoded as Multi-byte
As atom already proposed, there is an exception to all this. While some systems store hashes for passwords that are multi-byte charset encoded, most users enter the password via their western keyboards so most likely the passwords consist of characters that can be easily represented with a single byte character set like ASCII or LATIN-1.
So to attack those even multi-byte encoded passwords, one option is to convert the multi-byte charsets into single-byte charsets first and then single-byte attack them while maintaining the byte/charset relationship in the hashing algorithm. To maintain the relationship is necessary otherwise the algorithm won't find the hashes properly. So this method has an overhead but the overhead is most likely smaller then having everything (hashes, hash-algos, dictionaries and rules) in a multi-byte variant.
I assume that's what atom already was talking about. That means a specific mode not only for the algorithm but also the variant of it for a specific charset. I think this method is efficient but it has the trade-off not being able to crack multi-byte character passwords that consist of characters not covered by single-byte charsets / dictionaries.
My 2 cents.
-- hakre