hashcat Forum
a DIY wordlist generator - Printable Version

+- hashcat Forum (https://hashcat.net/forum)
+-- Forum: Misc (https://hashcat.net/forum/forum-15.html)
+--- Forum: User Contributions (https://hashcat.net/forum/forum-25.html)
+--- Thread: a DIY wordlist generator (/thread-12189.html)



a DIY wordlist generator - bored_dude - 10-15-2024

Hi,

Introduction

A little contributation to whom may be interested to make their own wordlist,  with continuous update, lightweight and simple as hell

The idea behind it was to find a (almost) limitless source of english words with constant update. As the english language continually evolve, having a "tool" to generate new words make sense.

The generator will use as source wikipedia, in particular the random article : https://en.wikipedia.org/wiki/Special:Random

This page on trigger, will redirect to a random article.

This could work on Windows, though the following instruction are for Linux (but definitively adaptable for Windows)

Getting Ready and Started

Create a new directory and go into

Code:
mkdir scrap && cd scrap

Create the wordlist file

Code:
touch dic.txt

Create the scrapping script (I use nano but any editor will work)

Code:
nano wikipedia_scrap.py

Insert inside

Code:
#!/usr/bin/env python3
import pycurl
from io import BytesIO
from bs4 import BeautifulSoup


buffer = BytesIO()
c = pycurl.Curl()
c.setopt(c.URL, "https://en.wikipedia.org/wiki/Special:Random")
c.setopt(c.FOLLOWLOCATION, True)
c.setopt(c.WRITEDATA, buffer)
c.perform()
c.close()
html = buffer.getvalue().decode("utf-8")

# GET HTML
soup = BeautifulSoup(html, "html.parser")

# GET URL from <link rel="canonical"
rurl = soup.find('link', {'rel' : 'canonical'}).get("href")

# Print the current (redirected url) where the scrap will happen
print(rurl)

# GET all <p> from soup var
ptext = soup.find_all('p')
# Extract text from <p>
for p in ptext:
    arr = p.text.split()

    for words in arr:
        # Need more work, remove punctuation
        words = words.replace(',','').replace('.','').replace('(','').replace(')','').lower()
        if words.isalpha():
            # Only words that have at least 5 chars
            if len(words) > 4:
                f = open("dic.txt","a")
                f.write(words + '\n')
                f.close()

Make the script executable

Code:
chmod +x wikipedia_scrap.py

Create the bash script that will act as a process and execute the python script every X seconds (timer can be change here)

Code:
nano exec.sh

Insert inside

Code:
#!/bin/bash

# On exit script (ctrl+c or kill), sort alphabetically and clean any double (or triple...) entries
trap "sort -u dic.txt > temp && mv temp dic.txt && exit" SIGINT

while :
do
  ./wikipedia_scrap.py
  # Set a pause between wikipedia request, can be change for lower value (unsure how many request per minute wikipedia will allow)
  sleep 10
done

Make the bash script executable

Code:
chmod +x exec.sh

Now run the bash script

Code:
./exec.sh

Cleaning

Unfortunatelly, determine if a word is in english language is tricky, most of the unwanted foreign words will be easily cleaned as they will be sorted after the last english word.

For example on my current generated dic

Code:
...
zygogaster
zygomatic
zygomorphic
zygomycetes
zygomycota
zygopetalinae
zygopetalon
zygopetalum
zygotaria
zygote
zymalkowski
zynetix
zysman
zytek
zytronic
zyuganov
zzzero
...

Everything after zzzero

Code:
zábřeh
záhady
záhony
zákupy
záleský
zámok
zánka
zápolya
...
électorale
électrique
éliphas
élisabeth
élite
éloize
élèves
éléonore
émigré
émigrés
émile
émilie
énergies
épinac
épiscopale
époque
épreuves
...
διοικητής
διοικηταὶ
δρουγουβιτεία
εβίνα
εθνική
εκλογική
ελλάδα
εσφούγγιζε
ευθύμης
εὐλόγιος
εὔρωψ
θέλεις
θαυμαζω
θεσσαλιῶτις
θεῖος
θρασὺς
...

Starting from there, other non english words can be found sorted between english word, this is the part I think can be improved.

Improving the script

The script can definitively be improved, I'm thinking adding a regex to exclude characters found in words like "divisão" or "phước", or using python library that can do a better job than the function .is_alpha()

Another way could be to change the source, instead of wikipedia, using the API of the New york time to scrap word inside article

I welcome any idea, suggestion or "contribution" to make this little project better. Just keep in mind I like to keep thing as simple as they can be

Here a generated wordlist made with this "tool", with a dirty fast cleaning, over 137770 words generated in ~40 hours of run

https://0x0.st/X6n3.txt

Thank you !


RE: a DIY wordlist generator - RealEnder - 10-18-2024

Wow, please don't do it this way. No need to hammer Wikipedia's site - it's slow and non-productive.
Wikipedia has dumps here: https://dumps.wikimedia.org . Just parse them and extract the words. The nice thing here is you can also pull article versions and catch mistypings and other interesting stuff.
Back in 2011 I wrote a similar script, which parses the dump on the fly and puts the wordlist in sqlite DB. It's python2 and written for targeted case, but can be easily changed to work. Check it out here: https://sec.stanev.org/?download
If there is an interest, I can shape it a bit.


RE: a DIY wordlist generator - DanielG - 10-18-2024

Agreed, this is also created at https://github.com/NorthwaveSecurity/wikiraider

"WikiRaider enables you to generate wordlists based on country specific databases of Wikipedia. This will provide you with not only a list of words in a specific language, it will also provide you with e.g. country specific artists, TV shows, places, etc."


RE: a DIY wordlist generator - bored_dude - 10-18-2024

Thanks guys, very useful infos !

I will definitively remake the script to use https://dumps.wikimedia.org instead and take at the code of wikiraider