Wordlist optimisation based on ruleset - Printable Version

Wordlist optimisation based on ruleset - Printable Version

+- hashcat Forum (https://hashcat.net/forum)
+-- Forum: Misc (https://hashcat.net/forum/forum-15.html)
+--- Forum: User Contributions (https://hashcat.net/forum/forum-25.html)
+--- Thread: Wordlist optimisation based on ruleset (/thread-8739.html)

Wordlist optimisation based on ruleset - eddie4 - 10-23-2019

Hello all,

Am currently sitting on about 400GByte of unique lines worth of wordlists. I pretty much downloaded all I could find sorted and deduped them. And am finding passwords. Am pretty much only using them with the Dive & best64 rules set. But it most defiantly is not the most effective/fasted way of finding passwords. The main problem I see is that this list is filled with different versions of the same base word.

Example:

Code:
welkom

welkom01

welkom01!

Welkom

This is what rule sets are for. In my opinion. But I was wondering if any one has made a reverse ruleset engine? You input a wordlist and ruleset. And it spits out only base words that are required with rules to come to that password. I was thinking of building something like this myself but I only know python and perhaps something like this already exists.

RE: Wordlist optimisation based on ruleset - royce - 10-29-2019

There has been some work in this space, but it's a challenging problem. And it's highly idiosyncratic in that the usefulness of any given wordlist and ruleset depends heavily on the nature of the passwords being attacked.

Generally, getting base words from passwords is sometimes called 'stemming' (a term borrowed from linguists). Some basic stemming can be done with rurapenthe's rurasort (https://github.com/bitcrackcyber/rurasort). IIRC Matt Weir (lakiw) has done work in this space as well.

Also, testing whether a given ruleset works well for a given wordlist and a given hashlist is an art in itself, and again depends heavily on the source and target material. hashcat's --debug-mode / --debug-file and --outfile* parameters are useful for this.

In general, you've got a much larger accumulation of base passwords than is likely to be efficient for most attack types, so your instincts are good. But before trying to get the base words from 400GB of raw wordlists, I'd start with a smaller corpus (like the hashes.org founds) and build up from there. For general password targets, that is more likely to be more time-efficient.

RE: Wordlist optimisation based on ruleset - eddie4 - 11-06-2019

Thank you for you're reply, rurasort looks interesting.

Am still thinking on how i would make this and not make it painfully slow. Might be really complicated and not worth it.

EDIT:
Ouch it's only reading at 2-3MB/s hitting 100% on a single core. Ill see if i can't improve on this if i do ill post it here.

EDIT2:

Am really bad with keeping up with forum posts. So am just gonna post the rough piece of code

rurasort.py --digit-trim --special-trim --lower 34.844s 10M lines
code below doing the same thing 2.503s for 10M lines

Code:
import multiprocessing as mp,os

path = "/mnt/NVMe/wordlist_10M.txt"

cores = 8

def process(line):

    newstring = line.lstrip('1,2,3,4,5,6,7,8,9,0')

    newstring = newstring.rstrip('1,2,3,4,5,6,7,8,9,0')

    newstring = newstring.lstrip("!\"#$%&'()*+,-./:;?@[\]^_`{|}~")

    newstring = newstring.rstrip("!\"#$%&'()*+,-./:;?@[\]^_`{|}~")

    print newstring.lower()

def process_wrapper(chunkStart, chunkSize):

    with open(path) as f:

        f.seek(chunkStart)

        lines = f.read(chunkSize).splitlines()

        for line in lines:

            process(line)

def chunkify(fname,size=1024*1024):

    fileEnd = os.path.getsize(fname)

    with open(fname,'r') as f:

        chunkEnd = f.tell()

        while True:

            chunkStart = chunkEnd

            f.seek(size,1)

            f.readline()

            chunkEnd = f.tell()

            yield chunkStart, chunkEnd - chunkStart

            if chunkEnd > fileEnd:

                break

#init objects

pool = mp.Pool(cores)

jobs = []

#create jobs

for chunkStart,chunkSize in chunkify(path):

    jobs.append( pool.apply_async(process_wrapper,(chunkStart,chunkSize)) )

#wait for all jobs to finish

for job in jobs:

    job.get()

#clean up

pool.close()

RE: Wordlist optimisation based on ruleset - rarecoil - 12-22-2019

(11-06-2019, 01:37 AM)eddie4 Wrote: Ouch it's only reading at 2-3MB/s hitting 100% on a single core. Ill see if i can't improve on this if i do ill post it here.

thanks for posting this. after reading this i pushed some similar changes back to rurasort, including some compile-once optimisations for hash-remove: https://github.com/bitcrackcyber/rurasort/pull/7

RE: Wordlist optimisation based on ruleset - aprizm - 01-23-2020

(12-22-2019, 09:35 AM)rarecoil Wrote:
(11-06-2019, 01:37 AM)eddie4 Wrote: Ouch it's only reading at 2-3MB/s hitting 100% on a single core. Ill see if i can't improve on this if i do ill post it here.

thanks for posting this. after reading this i pushed some similar changes back to rurasort, including some compile-once optimisations for hash-remove: https://github.com/bitcrackcyber/rurasort/pull/7

im pretty sure pack does what you want. I think its with the rulegen.py or something (its been a while I have done any cracking) and you have like an option to get it where it spits out a word list and a rule list that should cover the same keyspace so to speak.

Code:
$ python rulegen.py korelogic.txt -q

    [*] Using Enchant 'aspell' module. For best results please install

        'aspell' module language dictionaries.

    [*] Analyzing passwords file: korelogic.txt:

    [*] Press Ctrl-C to end execution and generate statistical analysis.

    [*] Saving rules to analysis.rule

    [*] Saving words to analysis.word

see the last lines in my capture : it saves the rules and the words but you have to play with settings to get a good equivalent coverage. This might take time if you have a big list.

RE: Wordlist optimisation based on ruleset - aprizm - 01-23-2020

oops sorry forgot the link : https://github.com/iphelix/pack

RE: Wordlist optimisation based on ruleset - mahoganyduck - 01-23-2020

PACK rulegen works great for this, but it's got issues with very large wordlists (pythons memory usage once it hits the sorting phase).

I did rulegen against rocktastic (13G, 1.1 billion entries). That might be along the lines of what you're looking for.

https://github.com/aaronjones111/cauldera

RE: Wordlist optimisation based on ruleset - pdoctor - 08-12-2020

Hey all I dont know if this is faster or not. But for the issue of really large file sizes, and speed this works really well for me.

For example something like this would trim out anything but letters and then unique the values

Code:
cat WORDLIST | sed 's#[^a-zA-Z]##g' | uniq > OUTPUT

WORDLIST

Quote:Test010!
test2
$$3tests

OUTPUT

Quote:Test
test
tests

Albeit is not very user friendly to figure out how to use, it does open up other possibilities like if your wordlist is compressed with gzip or xz or something you can use `zcat` or `xzcat` instead of `cat`. and then pipe it back into a compression at the end. well I hope this helps someone. you can also throw the `-i` flag on uniq command and have it strip out the case then you'd be left with:

OUTPUTCASE

Quote:test
tests

NOTE: the above is on linux natively and i think mac also. On windows you can use the same commands too if you install gitforwindows and make sure you choose to install gitbash in the options.