My private dictionary from hundreds of sources
#2
First, thanks for thinking to contribute!

I don't want to rain on your parade ... but wordlists without attribution can sometimes be problematic. Everyone goes through this phase. Wink

Code:
$ wc -l dictionary_private.dic
206282806 dictionary_private.dic

There's ... a lot ... of room for curation here, to put it mildly.

There are tens of thousands of uncracked hashes, user:32hex pairs (what look like uncracked user:hash pairs), raw 32hex and 48hex, "[32hex]:'[ipaddress]:[integer],1),", Facebook integer user IDs, found/hash mashups and wrapping errors, encoding errors ...

A large percentage also appears to be a raw scrape from one or more hashing forums of unknown provenance, but poorly split so that they're contaminated with HTML and/or JavaScript fragments - and so very unlikely to get any real-world hits. More than 7.7M of the lines are 33 characters or longer.

Other stats, in descending order after filtering at each layer:

* 114,731,822 (more than 50%) are already in hashes.org founds
* 8,193,987 are in hashes.org junk founds
* 6.1M are length 6 or less
* 2.1M are exactly 32 characters wide
* 1.4M appear to be HTML contaminated or JavaScript contaminated
* 1263897 have either an infix space or a semicolon
* ~1M are exactly length 30 and appear to either be hashes, base64, or randomly generated
* ~900K are exactly length 22 and appear to either be hashes, base64, or randomly generated
* ~100K appear to be truncated md5crypt
* ~200K appear to be salted 32hex (3-char salt)

Once I realized that a lot of this is harvested from password-cracking forums, there are many more hashes in there but I gave up trying to sift them out.

What do you think could be done to improve the quality of this dictionary?

Also, what can be done to *measure* the quality of this dictionary? For example, if you're getting hits with it, how many of those hits are also in the hashes.org founds?
~
Reply


Messages In This Thread
RE: My private dictionary from hundreds of sources - by royce - 08-19-2019, 04:40 AM