a DIY wordlist generator - Printable Version +- hashcat Forum (https://hashcat.net/forum) +-- Forum: Misc (https://hashcat.net/forum/forum-15.html) +--- Forum: User Contributions (https://hashcat.net/forum/forum-25.html) +--- Thread: a DIY wordlist generator (/thread-12189.html) |
a DIY wordlist generator - bored_dude - 10-15-2024 Hi, Introduction A little contributation to whom may be interested to make their own wordlist, with continuous update, lightweight and simple as hell The idea behind it was to find a (almost) limitless source of english words with constant update. As the english language continually evolve, having a "tool" to generate new words make sense. The generator will use as source wikipedia, in particular the random article : https://en.wikipedia.org/wiki/Special:Random This page on trigger, will redirect to a random article. This could work on Windows, though the following instruction are for Linux (but definitively adaptable for Windows) Getting Ready and Started Create a new directory and go into Code: mkdir scrap && cd scrap Create the wordlist file Code: touch dic.txt Create the scrapping script (I use nano but any editor will work) Code: nano wikipedia_scrap.py Insert inside Code: #!/usr/bin/env python3 Make the script executable Code: chmod +x wikipedia_scrap.py Create the bash script that will act as a process and execute the python script every X seconds (timer can be change here) Code: nano exec.sh Insert inside Code: #!/bin/bash Make the bash script executable Code: chmod +x exec.sh Now run the bash script Code: ./exec.sh Cleaning Unfortunatelly, determine if a word is in english language is tricky, most of the unwanted foreign words will be easily cleaned as they will be sorted after the last english word. For example on my current generated dic Code: ... Everything after zzzero Code: zábřeh Starting from there, other non english words can be found sorted between english word, this is the part I think can be improved. Improving the script The script can definitively be improved, I'm thinking adding a regex to exclude characters found in words like "divisão" or "phước", or using python library that can do a better job than the function .is_alpha() Another way could be to change the source, instead of wikipedia, using the API of the New york time to scrap word inside article I welcome any idea, suggestion or "contribution" to make this little project better. Just keep in mind I like to keep thing as simple as they can be Here a generated wordlist made with this "tool", with a dirty fast cleaning, over 137770 words generated in ~40 hours of run https://0x0.st/X6n3.txt Thank you ! RE: a DIY wordlist generator - RealEnder - 10-18-2024 Wow, please don't do it this way. No need to hammer Wikipedia's site - it's slow and non-productive. Wikipedia has dumps here: https://dumps.wikimedia.org . Just parse them and extract the words. The nice thing here is you can also pull article versions and catch mistypings and other interesting stuff. Back in 2011 I wrote a similar script, which parses the dump on the fly and puts the wordlist in sqlite DB. It's python2 and written for targeted case, but can be easily changed to work. Check it out here: https://sec.stanev.org/?download If there is an interest, I can shape it a bit. RE: a DIY wordlist generator - DanielG - 10-18-2024 Agreed, this is also created at https://github.com/NorthwaveSecurity/wikiraider "WikiRaider enables you to generate wordlists based on country specific databases of Wikipedia. This will provide you with not only a list of words in a specific language, it will also provide you with e.g. country specific artists, TV shows, places, etc." RE: a DIY wordlist generator - bored_dude - 10-18-2024 Thanks guys, very useful infos ! I will definitively remake the script to use https://dumps.wikimedia.org instead and take at the code of wikiraider |