Best way to reject tens/hundreds of billions of duplicate candidates?
#4
(01-02-2021, 06:44 PM)x34cha Wrote: So what you need is to find all passwords in the second list that arnt in the first list and extract those.

In a nutshell yes this is exactly what I am looking to do because hashing is much slower than password generation. I have done some google searches but those have dealt with smaller lists but have been having difficulty finding anything on how to deal with a problem of this scale. The 40 billion passwords I have already generated (let's call this my base set of unique) takes up ~1.2TB of space.

I imagine things like pre-sorting both the master and new list before comparing will help, but before writing my own custom algorithm to do this extraction process, I'm just trying to get input from the community on whether or not some kind of framework already exists to deal with this problem at this scale.

In the mean time I will start working on this to the side and try the following:
1. Store masterpwlist and newpwlist in two pre-sorted files (alphabetical?)
2. Iterating through newpwlist one-by-one and do a "<" lexicographic string comparison (all I really know is Python) against masterpwlist. Once that returns False, do an "==" comparison and if False it is unique. I'll dump that to a new file called uniquepwlist to run hashcat on. And by keeping track of position of masterpwlist, I should no longer have to start from the beginning of it on the next word comparison. Once hashing is done I can merge the uniquepwlist into masterpwlist.

Honestly I have no idea if that's the best or most efficient thing to do, but it's something to start with.

Edit: The more I read I see I may have described a merge sort, and maybe doable via command line expanding on what undeath said: Sort -u both files and then merge sort.
Reply


Messages In This Thread
RE: Best way to reject tens/hundreds of billions of duplicate candidates? - by outofstock - 01-02-2021, 07:34 PM