Hashcat Utilities List Cleaner
#1
I was wondering if I could interest you in producing more hashcat utilities ? I understand this takes time away from your work with the hashcat suite but I personally believe it is of almost equal importance when trying to audit passwords.

There is a saying about computers, garbage in garbage out . This saying has equal meaning when applied to hashcatplus and WPA key finding.

In our case I am of course thinking about word lists and the rather low quality ones which seem to get passed about. Hashcatplus is only as good as the words provided so I hope you will turn your unrivalled skills in this direction.

If you ever dare to peer inside one of these multiple Giga-Byte word lists you will quickly see for yourself just how low quality some of them are. Many lines include text which falls outside of the ASCII printable character range which is phenomenally unlikely to be used in anyone’s password. Duplicate entries, multiple entries with only an appended number to separate them, random numbers, e-mail address’s, toggled case characters meaning even the smallest word sample is many GB’s in size and of course the poorly formatted text with tabs and spaces at either end.

Now you have kindly provided us with rules in hashcatplus we don’t need these huge lists so they can be dramatically reduced in size and probably cover more keyspace also.

What hashcatplus users need is a collection of base words, all lowercase without the modification of pre-pended / appended numbers or poor formatting for example (tabs and spaces, HTML code, e-mail address’s etc) and then let the rules do the work.

There are a few tools and scripts available which are either hard to use or don’t work properly, sometimes both !! Many times I have used these tools only to discover I have lost many good words from a list. However the biggest frustration is the lack of options and the ability to work with very large lists when the users computer has a modest amount of RAM.

Blazers tool ULM is the best there is for ease of use and options but it suffers from a few bugs which prevent its use in some situations. To compound the problems with ULM Blazer has stated that he intends to retire ULM and cease development, which is a great shame as he was doing so well with it. I think this is a significant loss to the community.

Would you please consider turning your talents towards helping hashcat users clean up these lists ? I appreciate that at first it may seem to be a waste of your abilities making a simple text file cleaner but without it I think hashcat is hindered by poor lists.

If you are interested in this perhaps you might like to take a look at ULM and all of its features. I am not asking, nor do I expect you to incorporate all of its features into hashcat utilities but some good “purifying” tools would be very helpful. Removing the entire line if it contained any non ASCII characters would be a great start !

I will try not to write too much about this now as it is a subject I can really get into, but I have many ideas and I will wait to see if this subject interests you before going further. I do hope this inspires you as your software has a feel of quality about it and I just don’t get that from the other scripts and tools available at the moment.

Thank you.
#2
when i started with hashcracking i thought exact the same way. i thought: with a rule engine, it not neccesary any longer to have all these mutated words in my dictionary. what i need is a clean dictionary and then just add more rules. this will result in exactly the same and make my attacks much more efficient.

today i think different. sure, you can optimize out a lot of words, leading to quicker runs because less words. but you also loose a lot of potential password candidates. it is really hard to explain this experience. i try it with a single example and hope you get the whole picture out of it.

- this is from your view a "bad" word inside dictionary because its an email: honey@gmail.com
- here is a rule from rules/generated.rule: '5o0m

now run this in hashcat with --stdout and dont tell me you did expact that result Smile

if you really want some more clean dictionary, i can recommand this one: wikipedia-wordlist-sraveau-20090325.txt
just google it. you will find it. its an dump from the wikimedia files and you can be sure, there is nearly no garbage in there.
#3
Lightbulb 
I do feel that free command line utilities literally always work better and faster then their commercial counterparts. I find that I can do anything I need to in terms of word-list manipulation (simple to complex) using existing tools such as the free hashcat utilities and other free & commercial tools as well.

The only thing *I* would wish for if I had the chance to (in the hashcat utilities suite), is the addition of a utility that could sort MULTI GIGABYTE wordlists with the same awesome speed & efficiency as the existing tools have in doing their tasks.

There really IS no Windows tool that can do that yet, anywhere. Not "slow", and especially NOT "fast". Don't be fooled by commercial tools that say they "do this", as the output from those tools is usually a ruined version of the input file.

So that's what I'd like to see. *proper* sorting/duplicate-removal of multi-gigabyte wordlists that usually exceed 20 gigabytes. Of course, this would also mean having the ability to parse those very same wordlits and remove dangerous characters that other programs could "trip" on while processing those files, such as NULL characters, etc..

Maybe the volatile characters to be removed could be selected on the command line by the user as an extra bonus? :p
#4
(11-20-2011, 04:39 PM)atom Wrote: when i started with hashcracking i thought exact the same way.

Ahh...well...great minds and all that !! Big Grin

Thanks for the tip about that huge dictionary, I will be playing with that one for some time !! Smile

I sort of understand what you are teaching me but I forgot to mention I am pretty much always talking about WPA when I am here as that is my main interest.

Your technique is very good for user passwords on domains, websites and I suppose everyday logging in. This is where the user chooses something simple because they have to type it in everyday.

This is in no way a criticism or in anyway trying to belittle what you do but they are, to be fair, easy targets. I believe and I have some evidence to back this up that when people choose a WPA password they are in a different mindset altogether. They seem to understand that they only have to type this WPA password once or twice and then forget it. So they do seem to make an effort to pick a good one ! This is why I perhaps seem difficult on here because I am chasing a different target to you.

In your example I am sure the result would be in the list anyway, which is another reason I am trying to make unique password lists to try to reduce that. The rules in this case would probably have worked against you as you would have duplicated a common password. This is what I am trying to avoid by actually having a single instance of a “base” word and then performing common mutations on that source word. This being toggling case, pre fixing suffixing numbers and such like. Using the rules to the degree you were doing in your example would more than certainly duplicate many passwords, which is a speed killer especially with WPA and also probably producing many words that are just random.

Grinding through WPA PMK’s is a slow process as you are well aware so if I can reduce duplications to the maximum I will !

As I say, I bow to your superiority in this field without question, but do you see where I am struggling with this ? Am I going way off target with my ideas ?

Oh and referring to your …

Quote:honey@gmail.com

That was very …. r'3ri3i$f$a$lsfc Big Grin

If you get bored one day and feel like a bit of coding I hope you think about this request again, I do think there is some merit to it especially where lines that contain non printable ASCII are concerned.

Thank you very much for at least reading this and replying, Smile
#5
During the development of ULM, I was corresponding with Blazer about my idea of using corpora as word lists, along with ideas from Matt Weir's resesarch.

Those "word lists" out there are essentially pre-mangled, but with programs with john-derived languages, all we need are the base words. There is a whole body of formal linguistics, tools, and free corpus sources to be found (as many cost money).
#6
Hi Kgx Pnqvhm

Thats one interesting username ! At first I thought it was a modular or Cesar cipher but nothing came of it. Big Grin

Anyway thank you for your post, although very light on links it was very interesting to me. I used some terms you mentioned and did a bit of searching and found a very good site which I thought I would share here as I assume many would find it useful.

Large collection of words.

Unfortunately they require "spamming rights" over you as they demand an e-mail address before you are allowed to download, just warning you all.

I spent over a year bug testing for Blazer and I see we both get a special mention on the about page !

As I have mentioned before, it is a real loss to us all that Blazer has retired ULM, it is such a useful tool and there is nothing that comes close to it.

I couldn't find much on this.
john-derived languages

Quote:There is a whole body of formal linguistics, tools

Any chance you could name a couple and help us search for them ?

Thanks again for your post Kgx Pnqvhm, you have given me an altogether new subject to learn about. I like the idea of these corpus sources, which to be honest I had never heard of before.

Thanks Smile
#7
"Finding and creating effective input dictionaries is a non-trivial problem."
From Matt Weir's "Using Probabilistic Techniques to aid in Password Cracking Attacks" found at his tools site, under Presentations and Papers
http://sites.google.com/site/reusablesec/
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Some URLs from my notes:
http://en.wikipedia.org/wiki/Corpus_of_American_English
http://corpus.byu.edu/coca/wordfreq.asp?s=y
http://www.wordfrequency.info/
http://googleresearch.blogspot.com/2006/...o-you.html
http://googlesystem.blogspot.com/2010/12...iewer.html
http://ngrams.googlelabs.com/datasets
http://www.english-for-students.com/Words-List.html
http://en.wikipedia.org/wiki/American_National_Corpus
http://www.americannationalcorpus.org/frequency.html
http://www.anc.org/MASC/Download.html
http://en.wikipedia.org/wiki/British_National_Corpus
http://ucrel.lancs.ac.uk/bncfreq/flists.html
http://www.kilgarriff.co.uk/bnc-readme.html
http://www.natcorp.ox.ac.uk/corpus/index...D=products
http://faculty.washington.edu/dillon/Gra...ml#wintree
http://courses.washington.edu/englhtml/e...ources.htm
http://www.ota.ox.ac.uk/catalogue/index-id.html
http://wwwm.coventry.ac.uk/researchnet/B.../BAWE.aspx
http://conc.lextutor.ca/tuples
http://pie.usna.edu/
http://web-ngram.research.microsoft.com/info/
http://en.wiktionary.org/wiki/Wiktionary...ency_lists
http://www.pitt.edu/~naraehan/ling2050/r...rpora.html
http://xaira.sourceforge.net/
http://www.oucs.ox.ac.uk/rts/xaira/
http://www.americannationalcorpus.org/xaira.html
http://www.webcorp.org.uk/guide/howworks.html
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Unlike the linguists, all we want are the base words, which may be more accessible from frequency lists.
Full blown corpora tools may be need to extract those words from formal corpora.
#8
Hey, that's great !!! Thank you very much ! Smile

There are some good links there which I can see are going to keep me busy. I bet you have a fantastic word list collection !

Quote:Unlike the linguists, all we want are the base words,

Yes I totally agree, individual words can be joined and swapped about. Although now, my interest is with WPA ...mostly, I have thought about this for some time and with some practical experience I have noticed that many WPA passwords are actually phrases or quotes.

I think people seem to make more of an effort when choosing a WPA password as they don't have to type it in everyday. Quite often I find relatively good quality passwords being used for WPA. It seems to be a different world from the everyday, "user log-ins".

Anyway thank you very much for sharing those links with us, it seems to me like you have been interested in this for some time. I hope you hang around the forum for a while !

I'm still pondering that user name you have there ! Big Grin
#9
i dont think using "base-words" is a good idea in password cracking. my experience shows using already cracked passwords are much more efficient, especially when they are mangled again, than clean wordlists. the same applies on building markov stats out of it.

take a look at the fingerprint attack. its basing on pattern, and my research shows its efficiency is unbeaten. i admit its hard to understand because most people do not know what an combinator engine is and that this attack is the combination of using automatically generated pattern inside an combintor engine.

if you think wpa passwords differ from website passwords you are right. they are either unchanged or uncrackable. this means generated using a random generator (default passwords, keepass) and therfore only crackable using brute-force (or a lot of luck) or they are combinations of already known pattern of the users, maybe with more mutations, additional chars. again, take a look at the fingerprint attack.
#10
Quote:again, take a look at the fingerprint attack.

I will, thank you.