My private dictionary from hundreds of sources
#1
Hello hashcat forum users!

I thought i would share my private dictionary which came from hundreds of sources, successrate is very high, still getting hits after every hash updates. Sources is from real passwords.

Hope it will be in use. size is 2.8 GB duplicates removed.

1.52 GB ZIP compressed


http://www.mediafire.com/file/5zz9c8wt71...e.zip/file

Enjoy!
Reply
#2
First, thanks for thinking to contribute!

I don't want to rain on your parade ... but wordlists without attribution can sometimes be problematic. Everyone goes through this phase. Wink

Code:
$ wc -l dictionary_private.dic
206282806 dictionary_private.dic

There's ... a lot ... of room for curation here, to put it mildly.

There are tens of thousands of uncracked hashes, user:32hex pairs (what look like uncracked user:hash pairs), raw 32hex and 48hex, "[32hex]:'[ipaddress]:[integer],1),", Facebook integer user IDs, found/hash mashups and wrapping errors, encoding errors ...

A large percentage also appears to be a raw scrape from one or more hashing forums of unknown provenance, but poorly split so that they're contaminated with HTML and/or JavaScript fragments - and so very unlikely to get any real-world hits. More than 7.7M of the lines are 33 characters or longer.

Other stats, in descending order after filtering at each layer:

* 114,731,822 (more than 50%) are already in hashes.org founds
* 8,193,987 are in hashes.org junk founds
* 6.1M are length 6 or less
* 2.1M are exactly 32 characters wide
* 1.4M appear to be HTML contaminated or JavaScript contaminated
* 1263897 have either an infix space or a semicolon
* ~1M are exactly length 30 and appear to either be hashes, base64, or randomly generated
* ~900K are exactly length 22 and appear to either be hashes, base64, or randomly generated
* ~100K appear to be truncated md5crypt
* ~200K appear to be salted 32hex (3-char salt)

Once I realized that a lot of this is harvested from password-cracking forums, there are many more hashes in there but I gave up trying to sift them out.

What do you think could be done to improve the quality of this dictionary?

Also, what can be done to *measure* the quality of this dictionary? For example, if you're getting hits with it, how many of those hits are also in the hashes.org founds?
~
Reply
#3
(08-19-2019, 04:40 AM)royce Wrote: First, thanks for thinking to contribute!

I don't want to rain on your parade ... but wordlists without attribution can sometimes be problematic. Everyone goes through this phase. Wink

Code:
$ wc -l dictionary_private.dic
206282806 dictionary_private.dic

There's ... a lot ... of room for curation here, to put it mildly.

There are tens of thousands of uncracked hashes, user:32hex pairs (what look like uncracked user:hash pairs), raw 32hex and 48hex, "[32hex]:'[ipaddress]:[integer],1),", Facebook integer user IDs, found/hash mashups and wrapping errors, encoding errors ...

A large percentage also appears to be a raw scrape from one or more hashing forums of unknown provenance, but poorly split so that they're contaminated with HTML and/or JavaScript fragments - and so very unlikely to get any real-world hits. More than 7.7M of the lines are 33 characters or longer.

Other stats, in descending order after filtering at each layer:

* 114,731,822 (more than 50%) are already in hashes.org founds
* 8,193,987 are in hashes.org junk founds
* 6.1M are length 6 or less
* 2.1M are exactly 32 characters wide
* 1.4M appear to be HTML contaminated or JavaScript contaminated
* 1263897 have either an infix space or a semicolon
* ~1M are exactly length 30 and appear to either be hashes, base64, or randomly generated
* ~900K are exactly length 22 and appear to either be hashes, base64, or randomly generated
* ~100K appear to be truncated md5crypt
* ~200K appear to be salted 32hex (3-char salt)

Once I realized that a lot of this is harvested from password-cracking forums, there are many more hashes in there but I gave up trying to sift them out.

What do you think could be done to improve the quality of this dictionary?

Also, what can be done to *measure* the quality of this dictionary? For example, if you're getting hits with it, how many of those hits are also in the hashes.org founds?

Firstly thanks for your input, reason why most of them in the hashes.org are cause i have a user there with 1'700.000 found. 711,206 founds on hashkiller. And still getting hits, As for the "junk" it's just to split bad hashes or to give an idea about hash algorithm, The harvested forum was insidepro.com before it was closed. I couldn't filter everything out so mostly is still there. The other parts from real sites i don't want to go further on. I just wanted to share something that might be useful for some. But if you could give a hand filter the dictionary to be unique and with only real passwords it would be awesome. I tried but my knowledge is limited and found out that the junk wordlist help me out understand my hashes more... as not every 32 hex has to be md5. The other part is that the harvested hashes were 32x hex which i submitted to hashes.org which is the second reason why most of them are found and on hashes.org.

Thanks alot
Reply