Wordlist optimisation based on ruleset
#1
Hello all,

Am currently sitting on about 400GByte of unique lines worth of wordlists. I pretty much downloaded all I could find sorted and deduped them. And am finding passwords. Am pretty much only using them with the Dive & best64 rules set. But it most defiantly is not the most effective/fasted way of finding passwords. The main problem I see is that this list is filled with different versions of the same base word.

Example:

Code:
welkom
welkom01
welkom01!
Welkom

This is what rule sets are for. In my opinion. But I was wondering if any one has made a reverse ruleset engine? You input a wordlist and ruleset. And it spits out only base words that are required with rules to come to that password. I was thinking of building something like this myself but I only know python and perhaps something like this already exists.
Reply
#2
There has been some work in this space, but it's a challenging problem. And it's highly idiosyncratic in that the usefulness of any given wordlist and ruleset depends heavily on the nature of the passwords being attacked.

Generally, getting base words from passwords is sometimes called 'stemming' (a term borrowed from linguists). Some basic stemming can be done with rurapenthe's rurasort (https://github.com/bitcrackcyber/rurasort). IIRC Matt Weir (lakiw) has done work in this space as well.

Also, testing whether a given ruleset works well for a given wordlist and a given hashlist is an art in itself, and again depends heavily on the source and target material. hashcat's --debug-mode / --debug-file and --outfile* parameters are useful for this.

In general, you've got a much larger accumulation of base passwords than is likely to be efficient for most attack types, so your instincts are good. But before trying to get the base words from 400GB of raw wordlists, I'd start with a smaller corpus (like the hashes.org founds) and build up from there. For general password targets, that is more likely to be more time-efficient.
~
Reply
#3
Thank you for you're reply, rurasort looks interesting.


Am still thinking on how i would make this and not make it painfully slow. Might be really complicated and not worth it.

EDIT:
Ouch it's only reading at 2-3MB/s hitting 100% on a single core. Ill see if i can't improve on this if i do ill post it here.

EDIT2:

Am really bad with keeping up with forum posts. So am just gonna post the rough piece of code

rurasort.py --digit-trim --special-trim --lower 34.844s 10M lines
code below doing the same thing 2.503s for 10M lines

Code:
import multiprocessing as mp,os
path = "/mnt/NVMe/wordlist_10M.txt"
cores = 8
def process(line):
    newstring = line.lstrip('1,2,3,4,5,6,7,8,9,0')
    newstring = newstring.rstrip('1,2,3,4,5,6,7,8,9,0')
    newstring = newstring.lstrip("!\"#$%&'()*+,-./:;?@[\]^_`{|}~")
    newstring = newstring.rstrip("!\"#$%&'()*+,-./:;?@[\]^_`{|}~")
    print newstring.lower()


def process_wrapper(chunkStart, chunkSize):
    with open(path) as f:
        f.seek(chunkStart)
        lines = f.read(chunkSize).splitlines()
        for line in lines:
            process(line)

def chunkify(fname,size=1024*1024):
    fileEnd = os.path.getsize(fname)
    with open(fname,'r') as f:
        chunkEnd = f.tell()
        while True:
            chunkStart = chunkEnd
            f.seek(size,1)
            f.readline()
            chunkEnd = f.tell()
            yield chunkStart, chunkEnd - chunkStart
            if chunkEnd > fileEnd:
                break

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
for chunkStart,chunkSize in chunkify(path):
    jobs.append( pool.apply_async(process_wrapper,(chunkStart,chunkSize)) )

#wait for all jobs to finish
for job in jobs:
    job.get()

#clean up
pool.close()
Reply
#4
(11-06-2019, 01:37 AM)eddie4 Wrote: Ouch it's only reading at 2-3MB/s hitting 100% on a single core. Ill see if i can't improve on this if i do ill post it here.

thanks for posting this. after reading this i pushed some similar changes back to rurasort, including some compile-once optimisations for hash-remove: https://github.com/bitcrackcyber/rurasort/pull/7
Reply
#5
(12-22-2019, 09:35 AM)rarecoil Wrote:
(11-06-2019, 01:37 AM)eddie4 Wrote: Ouch it's only reading at 2-3MB/s hitting 100% on a single core. Ill see if i can't improve on this if i do ill post it here.

thanks for posting this. after reading this i pushed some similar changes back to rurasort, including some compile-once optimisations for hash-remove: https://github.com/bitcrackcyber/rurasort/pull/7

im pretty sure pack does what you want. I think its with the rulegen.py or something (its been a while I have done any cracking) and you have like an option to get it where it spits out a word list and a rule list that should cover the same keyspace so to speak.

Code:
$ python rulegen.py korelogic.txt -q
    [*] Using Enchant 'aspell' module. For best results please install
        'aspell' module language dictionaries.
    [*] Analyzing passwords file: korelogic.txt:
    [*] Press Ctrl-C to end execution and generate statistical analysis.
    [*] Saving rules to analysis.rule
    [*] Saving words to analysis.word

see the last lines in my capture : it saves the rules and the words but you have to play with settings to get a good equivalent coverage. This might take time if you have a big list.
Reply
#6
oops sorry forgot the link : https://github.com/iphelix/pack
Reply
#7
PACK rulegen works great for this, but it's got issues with very large wordlists (pythons memory usage once it hits the sorting phase).

I did rulegen against rocktastic (13G, 1.1 billion entries). That might be along the lines of what you're looking for.

https://github.com/aaronjones111/cauldera
Reply
#8
Hey all I dont know if this is faster or not. But for the issue of really large file sizes, and speed this works really well for me.

For example something like this would trim out anything but letters and then unique the values

Code:
cat WORDLIST | sed 's#[^a-zA-Z]##g' | uniq > OUTPUT

WORDLIST

Quote:Test010!
test2
$$3tests


OUTPUT

Quote:Test
test
tests



Albeit is not very user friendly to figure out how to use, it does open up other possibilities like if your wordlist is compressed with gzip or xz or something you can use `zcat` or `xzcat` instead of `cat`. and then pipe it back into a compression at the end. well I hope this helps someone. you can also throw the `-i` flag on uniq command and have it strip out the case then you'd be left with:


OUTPUTCASE
Quote:test
tests



NOTE: the above is on linux natively and i think mac also. On windows you can use the same commands too if you install gitforwindows and make sure you choose to install gitbash in the options.
Reply