Idea: Wordlist-creation from language-statistics
#1
Lightbulb 
Hi there,

I want to share an idea, which Im currently scripting on. I'm looking for people who want to give some input to improve the idea.

My problem
I cant find a good german wordlist for hashcracking. All wordlists I found are bad in some kind (strange, unrealistic words, not long enough...).
But brutforcing human-selected passwords are really challenging because of fantasy-words, words from personal context, and so on.

My Idea
If I would analyze some megabytes of text from a language (say: wikipedia) and create a statistic, how likely a character will follow to an other character, then I can create strings (I don't want say "words") with an defined overall likelihood. By increasing this overall-likelihood, I can generate strings that look like german words.

Current state
In my tests, the most common german 8char string is 'stendend'. Of corse, this is no german word, but is looks very "german" and its pronounceable.. By increasing the maximal allowed likelihood for the generator, it generates a lot of words.

The 8char-strings with an overall-likelyhood of 1 are:
stendeng, stendere, stendind, stengend, sterende, stindend, schenden, andenden
-> a lot of nonsense here!

some examples of calculated likelihoods of words:
hashcat 56
firefox 64
kölnerdom 61
langestrasse 29
suppentopf 49
bierfass 35
schnapps 41
ollesding 53
hundedreck 42

So in the likelihood-area of 20-70, there are very much realistic german words that are high-potential passwords. But the lists in this area are several gigabytes large.
A list of all words with a length of 6-8 char and a likelihood of 0-59 are about 60gig large. And combined with some hashcat-rules (Capitalize, append numbers..) there is a lot of work for hashcat.. If you go further with likelihood and word-length, the list-size of course increases drastically. And after a full generation of the wordlist, I got an full brutefore list with all possible combinations (charsetsize^8) but this list is ordered by something like a hit-chance.

State of the code
I have an ugly python-script which does the job done. It parses an input textfile for statistic-creation and generates words with defined length and likelihoods. Its about 150loc. Of course, I'm willing to share it, but its too ugly at the moment. At the moment, it is nothing more than an idea which might be good.

My questions
- What do you think about this idea?
- Are they ideas for more optimizations or other approaches?
- Are people here who have done some experiments in the same context?
- every feedback are welcome!
- or, simply: Do you have a good german wordlist?

So, good hunting!
PyDreamer
Reply
#2
If I'm not totally mistaken, you basically describe how markov could help and yeah:

hashcat has built-in support for markov and it's enabled by default (see --help and hcstat2gen from hashcat-utils).
Reply
#3
Hi Phil,
thanks for the hint! I where not aware of this. I will read about it!
Thank you!
Reply
#4
there is only one disadvantage: the markov chain support is limited to the mask-based attack types (-a 3, and the mask part of -a 6, -a 7).

So if you want to run a dictionary (-a 0 for instance), hashcat will still use the order of lines basically unchanged (well, it could come to some mix and different order because of parallelization across different OpenCL/CUDA devices) when running a wordlist-based attack (-a 0, -a 1 and also no change for the *words* in -a 6, -a 7, but -a 6 and -a 7 will use markov too for the mask).

That means, sometimes an external tool could make sense if you want to change the order of words in a dictionary file due to priorization of some words (pre-computing a "slightly" modified wordlist where only the order of lines is changed).
Reply
#5
Hi pydreamer,

you might want to have a look at this:
https://github.com/RUB-SysSec/OMEN
Reply