With all these new leaked list there is the issue of crap/generated passwords "contaminating" the word lists.
What methods do people here use to attempt to remove them?
Can you make an example of a "contamined" password?
The most recent article (2016) about this is "So, Just Why Is 18atcskd2w Such a Popular Password?" at:
http://www.tripwire.com/state-of-securit...-password/
One of the many articles discussing the Stratfor list, "Challenges with Evaluating Password Cracking Algorithms" at:
http://reusablesec.blogspot.com/2015/08/...sword.html
has the sentence: "A majority of the passwords in the Stratfor dataset were machine generated."
One method of detection is in "A list of flaws in the data set" where Mark Burnett writes about "Ten Million Passwords" he released:
"I have an algorithm in my Hurl script that looks for situations where both the username and password have abnormally high entropy and therefore likely both were computer-generated. The algorithm looks at many weighted criteria (such as both being exactly 8 characters long or containing only hex characters) and comes up with a score. I had the weight a littler lower than it should be to avoid false positives but that means there are still many passwords that were obviously not selected by humans."
I asked for an example not a book list
E.g., in the Stratfor list:
gyq3eftf
gyq3hmpr
gyq3vrwf
gyq9natb
gyq9z9cv
gyqbv3bl
gyqctog6
gyqggjkb
gyqgzubc
gyqh2eww
gyqjhue7
gyqjuaf7
gyqkc9b6
gyqkcern
gyqndsww
gyqpfmek
gyqpurue
gyqrect9
gyqruaud
gyqrxqp6
gyqu9tyq
gyqzvnvu
gyr5td9h
gyr5ywcp
gyr7dta6
gyr9kbsb
gyrbhszg
gyrc3fok
gyrc8ar7
gyrfdekt
gyrh4gaw
gyrh6ab3
gyrjohup
gyrkei7p
gyrkmebl
gyrokgw9
gyrprrya
gyrqkvuj
gyrrf7nz
gyrt4ar3
gyrtnepj
gyrunw6p
gyrv92bp
gyrx6qqr
gyrxdtvj
gyrxdu9d
gys2rfxr
Those look machine-generated to me, not something a human would do. The majority of the 8 character words are like that.
So you're looking for a way to distinguish human-generated passwords from machine-generated passwords? I know of no public tool that can do this, but it certainly can be done with some degree of accuracy using markov chains, machine learning, etc.
Right. Because those machine-generated passwords clutter up and make word lists less efficient. And when word lists get combined, the clutter/crap/noise increases geometrically.
That well known computer saying "Garbage In = Garbage Out" applied to combining bad word lists becomes "Garbage * Garbage = Garbage Squared."
Note that you don't want to drop them completely. Because such random looking passwords are mostly the golden passwords. In case a person actually uses it, there's an additional chance to this password is reused, especially for more important hashes. A cracked password shouldn't be removed from a wordlist, even if it looks random.
Over on the Hashes.org Forum, General, there are disussions such as "fake, corrupt and other crap hashes" (
https://hashes.org/forum/viewtopic.php?f=3&t=1709).
And they have "junk lists" on
https://hashes.org/crackers.php.
But the examples above are not in there.
Putting the ones I listed above into analysis tools such as PACK or using as "training lists" for other tools is a waste of time, and leads to erroneous/useless results. The research crowd seems to agree that the 8 character Statfor words are mostly machine-generated.
(If I remember correctly, atom's combined password in one of those articles about cracking was a combination of human created passwords, something to do with "mom of 8 great kids" or similar.)
--------------------------------------------------------------------------
One more item for my "book" above, even Team Hashcat's unix-ninja, in his "Password DNA" article at
https://www.unix-ninja.com/p/Password_DNA, mentined the need to sanitize:
"finally, entries which are known to belong to bots will be removed (these entries do not accurately reflect password authors' behaviours and only skew the results of a dictionary in unfavourable ways)"