Wordlist optimisation based on ruleset
#1
Hello all,

Am currently sitting on about 400GByte of unique lines worth of wordlists. I pretty much downloaded all I could find sorted and deduped them. And am finding passwords. Am pretty much only using them with the Dive & best64 rules set. But it most defiantly is not the most effective/fasted way of finding passwords. The main problem I see is that this list is filled with different versions of the same base word.

Example:

Code:
welkom
welkom01
welkom01!
Welkom

This is what rule sets are for. In my opinion. But I was wondering if any one has made a reverse ruleset engine? You input a wordlist and ruleset. And it spits out only base words that are required with rules to come to that password. I was thinking of building something like this myself but I only know python and perhaps something like this already exists.
Reply
#2
There has been some work in this space, but it's a challenging problem. And it's highly idiosyncratic in that the usefulness of any given wordlist and ruleset depends heavily on the nature of the passwords being attacked.

Generally, getting base words from passwords is sometimes called 'stemming' (a term borrowed from linguists). Some basic stemming can be done with rurapenthe's rurasort (https://github.com/bitcrackcyber/rurasort). IIRC Matt Weir (lakiw) has done work in this space as well.

Also, testing whether a given ruleset works well for a given wordlist and a given hashlist is an art in itself, and again depends heavily on the source and target material. hashcat's --debug-mode / --debug-file and --outfile* parameters are useful for this.

In general, you've got a much larger accumulation of base passwords than is likely to be efficient for most attack types, so your instincts are good. But before trying to get the base words from 400GB of raw wordlists, I'd start with a smaller corpus (like the hashes.org founds) and build up from there. For general password targets, that is more likely to be more time-efficient.
~
Reply
#3
Thank you for you're reply, rurasort looks interesting.


Am still thinking on how i would make this and not make it painfully slow. Might be really complicated and not worth it.

EDIT:
Ouch it's only reading at 2-3MB/s hitting 100% on a single core. Ill see if i can't improve on this if i do ill post it here.

EDIT2:

Am really bad with keeping up with forum posts. So am just gonna post the rough piece of code

rurasort.py --digit-trim --special-trim --lower 34.844s 10M lines
code below doing the same thing 2.503s for 10M lines

Code:
import multiprocessing as mp,os
path = "/mnt/NVMe/wordlist_10M.txt"
cores = 8
def process(line):
    newstring = line.lstrip('1,2,3,4,5,6,7,8,9,0')
    newstring = newstring.rstrip('1,2,3,4,5,6,7,8,9,0')
    newstring = newstring.lstrip("!\"#$%&'()*+,-./:;?@[\]^_`{|}~")
    newstring = newstring.rstrip("!\"#$%&'()*+,-./:;?@[\]^_`{|}~")
    print newstring.lower()


def process_wrapper(chunkStart, chunkSize):
    with open(path) as f:
        f.seek(chunkStart)
        lines = f.read(chunkSize).splitlines()
        for line in lines:
            process(line)

def chunkify(fname,size=1024*1024):
    fileEnd = os.path.getsize(fname)
    with open(fname,'r') as f:
        chunkEnd = f.tell()
        while True:
            chunkStart = chunkEnd
            f.seek(size,1)
            f.readline()
            chunkEnd = f.tell()
            yield chunkStart, chunkEnd - chunkStart
            if chunkEnd > fileEnd:
                break

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
for chunkStart,chunkSize in chunkify(path):
    jobs.append( pool.apply_async(process_wrapper,(chunkStart,chunkSize)) )

#wait for all jobs to finish
for job in jobs:
    job.get()

#clean up
pool.close()
Reply
#4
(11-06-2019, 01:37 AM)eddie4 Wrote: Ouch it's only reading at 2-3MB/s hitting 100% on a single core. Ill see if i can't improve on this if i do ill post it here.

thanks for posting this. after reading this i pushed some similar changes back to rurasort, including some compile-once optimisations for hash-remove: https://github.com/bitcrackcyber/rurasort/pull/7
Reply
#5
(12-22-2019, 09:35 AM)rarecoil Wrote:
(11-06-2019, 01:37 AM)eddie4 Wrote: Ouch it's only reading at 2-3MB/s hitting 100% on a single core. Ill see if i can't improve on this if i do ill post it here.

thanks for posting this. after reading this i pushed some similar changes back to rurasort, including some compile-once optimisations for hash-remove: https://github.com/bitcrackcyber/rurasort/pull/7

im pretty sure pack does what you want. I think its with the rulegen.py or something (its been a while I have done any cracking) and you have like an option to get it where it spits out a word list and a rule list that should cover the same keyspace so to speak.

Code:
$ python rulegen.py korelogic.txt -q
    [*] Using Enchant 'aspell' module. For best results please install
        'aspell' module language dictionaries.
    [*] Analyzing passwords file: korelogic.txt:
    [*] Press Ctrl-C to end execution and generate statistical analysis.
    [*] Saving rules to analysis.rule
    [*] Saving words to analysis.word

see the last lines in my capture : it saves the rules and the words but you have to play with settings to get a good equivalent coverage. This might take time if you have a big list.
Reply
#6
oops sorry forgot the link : https://github.com/iphelix/pack
Reply
#7
PACK rulegen works great for this, but it's got issues with very large wordlists (pythons memory usage once it hits the sorting phase).

I did rulegen against rocktastic (13G, 1.1 billion entries). That might be along the lines of what you're looking for.

https://github.com/aaronjones111/cauldera
Reply