Best way to reject tens/hundreds of billions of duplicate candidates?
I'm trying to crack an old Bitcoin Core wallet I've forgotten the password to. I remember part of it, but there's a lot of uncertainty on what came before or after this part, the positioning and etc. As a result I'm generating billions of passwords to guess. It's the one time in my life I kind of went off the wall - but I know I did it in a way that is similar to every other password I've created. So in a sense I'm trying to crack myself.

I have my own custom algorithm and set of rules, and I've generated nearly 40 billion passwords in the past 2 days. I've started running hashcat on these lists, and my current hardware will take ~25 days to get through these passwords (I may order more GPUs if they ever become available).

I'm looking for a way to somehow track this so that when I generate new passwords any duplicates will be discarded prior to hashing.

I have no idea where to go from here and being a total noob just looking into this I've so far thought of two ways:
1. Hashcat brain
2. Some kind of SQL database with unique constrained column for new passwords

I'm not sure if either of these suitable nor do I expect them to be fast if they are. But as you can see my password generation to hashing rate is extremely lopsided in favor of password generation. Thus if I take 2 days to generate another set of 40 billion, and then it takes another day (I'm making up number up here for example) to discard 20 billion duplicates - I will save 12 days worth of time and electricity. Even if it took 10 days in this example I'd at least save the majority of electricity costs during that time. And for what it's worth - I'm willing to build a PCIe 4.0 NVMe RAID0 array (8-16TB) to store/query these passwords that I would regularly backup to a NAS.

If there's nothing suitable I could either go "all in" and just generated some enormous list of 500B-1T passwords and just spend the next year hashing and hoping for the best, but I'd prefer not to as the chances of me remembering everything up front are slim. Alternatively I could get more strict on my rules and greatly reduce the number of candidates, but then the chances I miss something go up.

Any suggestions or help here would be greatly appreciated.
SQL is no well-suited for this endeavour. Generate the lists and run sort -u on them.
(01-02-2021, 12:35 PM)undeath Wrote: SQL is no well-suited for this endeavour. Generate the lists and run sort -u on them.

Wouldn't sort -u still give him the ones he has already tried though? Just uniques over two lists. So what you need is to find all passwords in the second list that arnt in the first list and extract those. Im not good with sort/awk/sed etc but I'm sure a quick Google search would do you well.
(01-02-2021, 06:44 PM)x34cha Wrote: So what you need is to find all passwords in the second list that arnt in the first list and extract those.

In a nutshell yes this is exactly what I am looking to do because hashing is much slower than password generation. I have done some google searches but those have dealt with smaller lists but have been having difficulty finding anything on how to deal with a problem of this scale. The 40 billion passwords I have already generated (let's call this my base set of unique) takes up ~1.2TB of space.

I imagine things like pre-sorting both the master and new list before comparing will help, but before writing my own custom algorithm to do this extraction process, I'm just trying to get input from the community on whether or not some kind of framework already exists to deal with this problem at this scale.

In the mean time I will start working on this to the side and try the following:
1. Store masterpwlist and newpwlist in two pre-sorted files (alphabetical?)
2. Iterating through newpwlist one-by-one and do a "<" lexicographic string comparison (all I really know is Python) against masterpwlist. Once that returns False, do an "==" comparison and if False it is unique. I'll dump that to a new file called uniquepwlist to run hashcat on. And by keeping track of position of masterpwlist, I should no longer have to start from the beginning of it on the next word comparison. Once hashing is done I can merge the uniquepwlist into masterpwlist.

Honestly I have no idea if that's the best or most efficient thing to do, but it's something to start with.

Edit: The more I read I see I may have described a merge sort, and maybe doable via command line expanding on what undeath said: Sort -u both files and then merge sort.
what you describe can be achieved by running

sort -m masterpwlist newpwlist | uniq -u

where masterpwlist and newpwlist are sorted