Extracting the passwords from a multiple file wordlist (sed & grep).
Hello everybody. Lets say you "hypothetically" encounter a wordlist that not only is made up of many files, but ("hypothetically") contains a lot of other information about users (lets say it's a "hypothetical" leak). As a good guy, you don't need all that info, actually you don't want that info. It would have been great if the list was already cleaned up and the passwords extracted.

I decided to learn some grep & sed, so this seemed like a great way to get started with those tools. Here's how you could extract all the passwords and clean up the file.

Extract all files in a directory. You will have to identify something unique about the lines containing the passes you want to extract. Lets say it looks something like:

comment=dont share lists containing user emails!

and this goes on forever. Here we'll extract every line containing "pass=" from every file in the folder "extracted_directory" using grep. Note that you need to be one level 'up' from that directory in bash (or your terminal).

grep -rhi 'pass=' extracted_directory/ > wordlist_merged.txt

Isn't grep awesome Smile Plus you are being legit and not looking at personal stuff. Now we have to remove "pass=" from the beginning of every line from that file. We can use sed:

sed 's/pass=//g' wordlist_merged.txt > wordlist_cleaned.txt

Let's remove leading and trailing whitespaces:

cat wordlist_cleaned.txt | sed 's/^[ ]*//;s/[ ]*$//' > wordlist_whatever.txt

Now you can remove duplicates and sort the file starting with the most used password:

cat wordlist_whatever.txt | sort | uniq -c | sort -nr > wordlist_sorted.txt

This results in a list with numbers in front of every password. We want to remove those using sed:

cat wordlist_sorted.txt | sed 's/^[ ]*[1234567890]*[ ]//' > wordlist_FINAL.txt

And there you go! A nicely sorted and cleaned list Smile

Alert: Always back-up your lists before doing any of this!
Have a great day.
You can combine your first four commands into a single sed command. Since it appears you have an example that uses multiple nested directories (grep -r) you can use find + sed. And since you used grep -i I'll assume that pass= could be in any case.

find -type f | xargs sed -rn 's/^[Pp][Aa][Ss][Ss]=(.*)$/\1/p' | sort -u >wordlist_whatever.txt

Sorting the wordlist by most used password first isn't a bad idea I suppose, you could still work that into that.

find -type f | xargs sed -rn 's/^[Pp][Aa][Ss][Ss]=(.*)$/\1/p' | sort | uniq -c | sort -bnrk 1 | sed -r 's/\W+[0-9]+\W+//' >wordlist_ordered_whatever.txt
type *.* | sed '/pass=/I!d;s/pass=\(.*$\)/\1/I;s/^[ \t]*//;s/[ \t]*$//' | sort | uniq -c | sort -nr | cut -c9- > wordlist_FINAL.txt
Wow! I'm just starting with sed & grep. I'm no bash voodoo but I try to get around. So if I get this straight your finding every regular file in a directory, using sed your identifying the correct lines and replacing that with a backreference \1. I'm reading up on back-references but it isn't clear to me yet. '{}' tells sed what file we're working on and I don't get the last \...

While sorting, I wouldn't use -k1 since the output file will contain some white space in front of the numbers. It is formatted as so:


Hope this is kinda clear Smile

I used -i since I didn't want to miss anything (just in case), and -r because there are many files that I'm extracting the passwords from. I always underestimate the power of piping, and always forget you can end a bash line with ;

Again, thank you very much for that, awesome stuff.

Edit: Woah! Somebody posted while I was writing this. Loving these forums Smile

(06-14-2012, 07:35 PM)M@LIK Wrote: Optimization?
type *.* | sed '/pass=/I!d;s/pass=\(.*$\)/\1/I;s/^[ \t]*//;s/[ \t]*$//' | sort | uniq -c | sort -nr | cut -c9- > wordlist_FINAL.txt

On OS X \t isn't supported, you have to actually press tab. It uses an older version of sed... Also, cut -c9- wont really work since there are many numbers of different lengths. You can pipe it back into
sed 's/^[ ]*[1234567890]*[ ]//'

and it should be ok. Or am I missing something ?
M@LIK uses Windows, so his command will not work for you. For instance, ''type'' is like ''cat'' on Windows, whereas in most unix shells ''type'' shows how a command will be interpreted by the shell. And you're right, the ''cut -c9-'' probably isn't portable. Although sed -r 's/\W+[0-9]+\W+//' is a lot cleaner than sed 's/^[ ]*[1234567890]*[ ]//' if bsd sed supports that syntax.

Speaking of sed, your sed isn't different because it's older, your sed is different because it's bsd sed and not gnu sed. As you learn how to use more console utils, you'll find that the bsd variants of pretty much everything do not have even half of the options their gnu counterparts have. But, now knowing that you're on OS X, my commands probably won't work for you either.

To answer your questions,

Quote:So if I get this straight your finding every regular file in a directory

No, in a directory and all its subdirectories. If all of your files are in one directory then you wouldn't need to use 'find' in my example; I only did that because you used grep -r. In your original example, if all of your files are in one directory then you don't need to do a recursive grep, you would just do grep *. And the same goes for M@LIK as well; he wouldn't need to do 'type *.* | sed', he could just do 'sed *.*' or whatever. It's the Windows equivalent of UUoC.

Quote:sed your identifying the correct lines and replacing that with a backreference \1.

Correct. We match only the portion of the line we care about, i.e. everything after pass=. Then we print just the match that we care about. \1 is the first grouping, \2 the second, so on and so forth.

Quote:'{}' tells sed what file we're working on and I don't get the last \...

That's all part of the ''find'' command. I later modified the command to use xargs instead which is much faster when dealing with large groups of files. What I had before with using ''find -exec'' may not apply on bsd find, I can't remember.

Quote:While sorting, I wouldn't use -k1 since the output file will contain some white space in front of the numbers.

-b tells sort that whitespace doesn't matter

So -- if all your files are in one directory and you don't actually need to recursively search, and you're pretty sure that 'pass=' will always be lower case, we can clean up my command a bit. I have no idea if it will work with bsd sed, but this will probably help out some gnu users.

sed -rn 's/^pass=(.*)$/\1/p' * | sort | uniq -c | sort -bnrk 1 | sed -r 's/\W+[0-9]+\W+//'

Happy learning Smile
Agreed. Thanks for the input.
Thanks for going through everything, this is awesome. I love me some command learning, so powerful Big Grin So I shouldn't be using -r, I misunderstood that parameter (I actually don't need grep). It will definitely make everything faster.

Usually, when using gnu commands (from the net), I just try them and slowly work up the errors I get. But most of the time it ends up working. If all fails my Linux VM is always ready to boot Smile

Your help is very appreciated.
This thread is brilliant, just what I was hoping for in this section, thanks !

Here's one I have been working on for a while if the regex guru's would care to comment on it ?

I have often wanted to sort the purely random passwords from a password list. I understand they could be genuine passwords but I believe they are less likely.

So... suppose I have this list...


I really only want to keep..


So this requires some sort of logic filter. I at first made a regex like this...

The code above was to remove anything that didn't contain a vowel. This worked surprisingly well but it has it's limitations.

If used on the list above I would lose "m00n" and c0d3w0rd", both good password candidates.

The only thing I can think of is to use something like office spellchecker which somehow tries to match as closely as possible any given text to a word. Then perhaps add some common "leet" terms to catch the modified passwords such as "m00n".

Anyone think of anything smarter than this ?

Thank you.
random is easy to generate but hard to detect. humans recognize what appears to be random by identifying that there are no recognizable patterns in the text. programmatically speaking, you have to do the same.

one way you could approach this is kind of like the spell checker approach you mentioned -- you could check each word in your wordlist against a dictionary and only print matches. you'd need to do case insensitive matches and do basic "l33t" substitutions, etc.

a VERY simple example that is guaranteed to have false negatives would be something like this:

while read plain; do
    baseword="$(echo $plain | sed -r 'y/4310/aeio/; s/[^a-zA-Z]//g; s/^.(.*)./\1/' | tr A-Z a-z)"

    if grep "$baseword" words.english.txt >/dev/null; then
        echo $plain
        echo "'$plain' appears to be random: couldn't find '$baseword' in the dictionary." >&2
done < sample.plains

this is a mediocre implementation at best, of course. the biggest problem with this code is it doesn't do any sort of detection of made-up compound words (like 'applebanana' for example), but the spellcheck approach would have this problem, too.

here's an example. using these plains:


this script outputs the following:

'applebanana' appears to be random: couldn't find 'pplebanan' in the dictionary.
'8dJ3na3Ldn4' appears to be random: couldn't find 'jenaeldn' in the dictionary.
'tac0v4gina5' appears to be random: couldn't find 'acovagin' in the dictionary.
'aN7b3mlK' appears to be random: couldn't find 'nbeml' in the dictionary.

so, you know, not too bad. but far from good. this might give you something to build upon though.
Hi epixoip

Thats very clever of you thank you for posting ! Also thank you for brightening up my day by making me laugh with your choice of example words ! cockmaster ! Big Grin

That is a very clever approach you have made, I notice that you are attempting to chop the beginning and end from words to try to remove the "password padding".

This can be done with hashcat rules [] [[]] [[[]]] etc. You could keep chopping bits off until you find a word match I suppose. This way applebanana would eventually be found by either apple or banana, very clever epixoip.

At first when you posted I didn't see how I would save any new words that were not in the original dictionary, but now I do since you posted the demo.

I wonder if this could be optimised further by some common rules, such as u always follows q and i before e except after c etc.

Thank you very much for your interest in this.