Wordlist optimisation based on ruleset
Hello all,

Am currently sitting on about 400GByte of unique lines worth of wordlists. I pretty much downloaded all I could find sorted and deduped them. And am finding passwords. Am pretty much only using them with the Dive & best64 rules set. But it most defiantly is not the most effective/fasted way of finding passwords. The main problem I see is that this list is filled with different versions of the same base word.



This is what rule sets are for. In my opinion. But I was wondering if any one has made a reverse ruleset engine? You input a wordlist and ruleset. And it spits out only base words that are required with rules to come to that password. I was thinking of building something like this myself but I only know python and perhaps something like this already exists.
There has been some work in this space, but it's a challenging problem. And it's highly idiosyncratic in that the usefulness of any given wordlist and ruleset depends heavily on the nature of the passwords being attacked.

Generally, getting base words from passwords is sometimes called 'stemming' (a term borrowed from linguists). Some basic stemming can be done with rurapenthe's rurasort (https://github.com/bitcrackcyber/rurasort). IIRC Matt Weir (lakiw) has done work in this space as well.

Also, testing whether a given ruleset works well for a given wordlist and a given hashlist is an art in itself, and again depends heavily on the source and target material. hashcat's --debug-mode / --debug-file and --outfile* parameters are useful for this.

In general, you've got a much larger accumulation of base passwords than is likely to be efficient for most attack types, so your instincts are good. But before trying to get the base words from 400GB of raw wordlists, I'd start with a smaller corpus (like the hashes.org founds) and build up from there. For general password targets, that is more likely to be more time-efficient.
Thank you for you're reply, rurasort looks interesting.

Am still thinking on how i would make this and not make it painfully slow. Might be really complicated and not worth it.

Ouch it's only reading at 2-3MB/s hitting 100% on a single core. Ill see if i can't improve on this if i do ill post it here.


Am really bad with keeping up with forum posts. So am just gonna post the rough piece of code

rurasort.py --digit-trim --special-trim --lower 34.844s 10M lines
code below doing the same thing 2.503s for 10M lines

import multiprocessing as mp,os
path = "/mnt/NVMe/wordlist_10M.txt"
cores = 8
def process(line):
    newstring = line.lstrip('1,2,3,4,5,6,7,8,9,0')
    newstring = newstring.rstrip('1,2,3,4,5,6,7,8,9,0')
    newstring = newstring.lstrip("!\"#$%&'()*+,-./:;?@[\]^_`{|}~")
    newstring = newstring.rstrip("!\"#$%&'()*+,-./:;?@[\]^_`{|}~")
    print newstring.lower()

def process_wrapper(chunkStart, chunkSize):
    with open(path) as f:
        lines = f.read(chunkSize).splitlines()
        for line in lines:

def chunkify(fname,size=1024*1024):
    fileEnd = os.path.getsize(fname)
    with open(fname,'r') as f:
        chunkEnd = f.tell()
        while True:
            chunkStart = chunkEnd
            chunkEnd = f.tell()
            yield chunkStart, chunkEnd - chunkStart
            if chunkEnd > fileEnd:

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
for chunkStart,chunkSize in chunkify(path):
    jobs.append( pool.apply_async(process_wrapper,(chunkStart,chunkSize)) )

#wait for all jobs to finish
for job in jobs:

#clean up