UTF-8 dictionaries (hex format)

UTF-8 dictionaries (hex format) - Printable Version

+- hashcat Forum (https://hashcat.net/forum)
+-- Forum: Misc (https://hashcat.net/forum/forum-15.html)
+--- Forum: General Talk (https://hashcat.net/forum/forum-33.html)
+--- Thread: UTF-8 dictionaries (hex format) (/thread-6904.html)

UTF-8 dictionaries (hex format) - devilsadvocate - 09-28-2017

Would anyone know of a good source for hex formatted, non-English UTF-8 dictionaries for use with the --hex-wordlist option that hashcat provides?

Short of that, has anyone come up with rules for substituting single-byte characters with two-byte characters that some foreign languages use? German would be one example.

These can be built with existing dictionaries that use single byte characters.

Take existing single byte foreign language dictionaries, convert them to hex, replace the applicable hex codes with their UTF-8 equivalent and have more effective foreign language (non-English) dictionaries.

For example, the German letter, Ü would be C3 9C in hex.

In this example, a script that does search and replace for hex codes 75 and 55 (u and U respectively) would replace those hex numbers with C3BC and C39C (respectively).

For this example of the letter "u", single byte hex is converted to its corresponding UTF-8 2-byte hex representation.

u = 75
U = 55
becomes
ü = C3BC
Ü = C39C

Also, this is straying off the topic of this post, but something similar could also be built for doing this type of substitution of ASCII characters with their single-byte LATIN1 equivalent. This would be for non-English dictionaries that are using non-English words with only ASCII characters.

There are a lot of non-English dictionaries publicly available, but so many of them are using only ASCII characters for their words. I see this the most with Spanish dictionaries.

Footnote on this character, Ü:
https://en.wikipedia.org/wiki/%C3%9C

This character is actually common to several non-English languages, not just German. Stolen from the wikipedia page: "Hungarian, Turkish, Uyghur Latin, Estonian, Azeri, Turkmen, Crimean Tatar, Kazakh Latin and Tatar Latin alphabets"

RE: UTF-8 dictionaries (hex format) - devilsadvocate - 09-29-2017

Seek and you shall find.

https://github.com/wooorm/dictionaries

https://github.com/titoBouzout/Dictionaries

Now I need to figure out how to apply rules to multi-byte utf-8 characters. *logs off to research more*

To edit the trailing slash out of the files, you can run this command.

find . -name \*.dic -exec sed -i 's-/.*--g' {} \;

RE: UTF-8 dictionaries (hex format) - atom - 10-01-2017

Note when it comes to non latin wordlist, I'd recommend to use req-include from hashcat-utils to move them into a separate wordlist, convert it to utf-8 and then use this wordlist in combination with the --encode-to option (since v3.6.0) with the destination encoding type you want to use.