First, thanks for thinking to contribute!
I don't want to rain on your parade ... but wordlists without attribution can sometimes be problematic. Everyone goes through this phase.
Code:
$ wc -l dictionary_private.dic
206282806 dictionary_private.dic
There's ... a lot ... of room for curation here, to put it mildly.
There are tens of thousands of uncracked hashes, user:32hex pairs (what look like uncracked user:hash pairs), raw 32hex and 48hex, "[32hex]:'[ipaddress]:[integer],1),", Facebook integer user IDs, found/hash mashups and wrapping errors, encoding errors ...
A large percentage also appears to be a raw scrape from one or more hashing forums of unknown provenance, but poorly split so that they're contaminated with HTML and/or JavaScript fragments - and so very unlikely to get any real-world hits. More than 7.7M of the lines are 33 characters or longer.
Other stats, in descending order after filtering at each layer:
* 114,731,822 (more than 50%) are already in hashes.org founds
* 8,193,987 are in hashes.org junk founds
* 6.1M are length 6 or less
* 2.1M are exactly 32 characters wide
* 1.4M appear to be HTML contaminated or JavaScript contaminated
* 1263897 have either an infix space or a semicolon
* ~1M are exactly length 30 and appear to either be hashes, base64, or randomly generated
* ~900K are exactly length 22 and appear to either be hashes, base64, or randomly generated
* ~100K appear to be truncated md5crypt
* ~200K appear to be salted 32hex (3-char salt)
Once I realized that a lot of this is harvested from password-cracking forums, there are many more hashes in there but I gave up trying to sift them out.
What do you think could be done to improve the quality of this dictionary?
Also, what can be done to *measure* the quality of this dictionary? For example, if you're getting hits with it, how many of those hits are also in the hashes.org founds?