Getting "unruly": Finding base words
#1
Lightbulb 
This is what I use to find base words in a list of plains. I am posting it both to share and to see if others have ideas for improving it.

Code:
cat plains | tr A-Z a-z | sed 's/^[^a-z]*//g; s/[^a-z]*$//g; y/112345677890@\$\!\#/ilzeasbzvbgoasih/; s/[^a-z]//g; /^$/d' >basewords

A few explanations:

First, I use tr instead of sed to convert upper to lower, both because it's much faster, and because it plays better with unicode.

I then strip out all non-alpha chars from the beginning and end of the line.

Then I do common l33t substitutions (this can probably be improved.)

Then I strip out all non-lower alpha chars, and delete any empty lines.

Example: take the following plains

Code:
l33t1979
h4$hcaT2012
39bananas
69cockmaster69

Becomes:

Code:
leet
hashcat
bananas
cockmaster

All comments, thoughts, and flames welcome.
Reply
#2
Nice work there, epixoip !

I am very interested to see if anyone here can help improve this as it is something I am hoping to be able to do.

Unfortunately you are way ahead of me so I don't think I can contribute much apart from occasionally bumping this thread ! Smile
Reply
#3
Hi epixoip

Just to let you know that your efforts on this were not in vain ! Smile

We have managed to inspire Blazer to add his own version of this to ULM.

He likes to do things his own way so it will be interesting to see the results.
Reply
#4
right on Smile
Reply
#5
4 years later and your command is still working perfect. Thanks for that! (I'm just in the hashcat learning process...)

Just one question about special, e.g. german Characters:

My wordlists contains, for example, the word "könig". In german you sometimes write "oe" for "ö".
So does it make sense to add "koenig" to my list of baseword as well? Or is it better to write a rule (if it isn't already somewhere) for that?


And: what's about stuff like this:


���
����

(For me there are questions marks, I guess because of encoding problems.)
That isn't of any value for my baseword list, is it?
Reply
#6
@epixoip: just a cosmetic change, before putting output into file:

Code:
 | sort -u > basewords
Reply
#7
With your command you're not lowercase'ing stuff like german umlauts... (Ä --> ä, Ö --> ö etc.). But I'm not sure if the corresponding rule (toggle) does it... I have to check it out.
Reply
#8
(08-17-2016, 08:43 PM)hashcrash Wrote: With your command you're not lowercase'ing stuff like german umlauts... (Ä --> ä, Ö --> ö etc.). But I'm not sure if the corresponding rule (toggle) does it... I have to check it out.

Based on the epixoip's code, you could use 'sed' instead understand foreign characters like German, French, Turkisch ..

The code then becomes
Code:
sed 's/[[:upper:]]*/\L&/g' infile | sed 's/^[^[:lower:]]*//g; s/[^[:lower:]]*$//g; y/112345677890@\$\!\#/ilzeasbzvbgoasih/; s/[^[:lower:]]//g; /^$/d' >outfile
Reply