a DIY wordlist generator - Printable Version

a DIY wordlist generator - Printable Version

+- hashcat Forum (https://hashcat.net/forum)
+-- Forum: Misc (https://hashcat.net/forum/forum-15.html)
+--- Forum: User Contributions (https://hashcat.net/forum/forum-25.html)
+--- Thread: a DIY wordlist generator (/thread-12189.html)

a DIY wordlist generator - bored_dude - 10-15-2024

Hi,

Introduction

A little contributation to whom may be interested to make their own wordlist, with continuous update, lightweight and simple as hell

The idea behind it was to find a (almost) limitless source of english words with constant update. As the english language continually evolve, having a "tool" to generate new words make sense.

The generator will use as source wikipedia, in particular the random article : https://en.wikipedia.org/wiki/Special:Random

This page on trigger, will redirect to a random article.

This could work on Windows, though the following instruction are for Linux (but definitively adaptable for Windows)

Getting Ready and Started

Create a new directory and go into

Code:
mkdir scrap && cd scrap

Create the wordlist file

Code:
touch dic.txt

Create the scrapping script (I use nano but any editor will work)

Code:
nano wikipedia_scrap.py

Insert inside

Code:
#!/usr/bin/env python3

import pycurl

from io import BytesIO

from bs4 import BeautifulSoup

buffer = BytesIO()

c = pycurl.Curl()

c.setopt(c.URL, "https://en.wikipedia.org/wiki/Special:Random")

c.setopt(c.FOLLOWLOCATION, True)

c.setopt(c.WRITEDATA, buffer)

c.perform()

c.close()

html = buffer.getvalue().decode("utf-8")

# GET HTML

soup = BeautifulSoup(html, "html.parser")

# GET URL from <link rel="canonical"

rurl = soup.find('link', {'rel' : 'canonical'}).get("href")

# Print the current (redirected url) where the scrap will happen

print(rurl)

# GET all <p> from soup var

ptext = soup.find_all('p')

# Extract text from <p>

for p in ptext:

    arr = p.text.split()

    for words in arr:

        # Need more work, remove punctuation

        words = words.replace(',','').replace('.','').replace('(','').replace(')','').lower()

        if words.isalpha():

            # Only words that have at least 5 chars

            if len(words) > 4:

                f = open("dic.txt","a")

                f.write(words + '\n')

                f.close()

Make the script executable

Code:
chmod +x wikipedia_scrap.py

Create the bash script that will act as a process and execute the python script every X seconds (timer can be change here)

Code:
nano exec.sh

Insert inside

Code:
#!/bin/bash

# On exit script (ctrl+c or kill), sort alphabetically and clean any double (or triple...) entries

trap "sort -u dic.txt > temp && mv temp dic.txt && exit" SIGINT

while :

do

  ./wikipedia_scrap.py

  # Set a pause between wikipedia request, can be change for lower value (unsure how many request per minute wikipedia will allow)

  sleep 10

done

Make the bash script executable

Code:
chmod +x exec.sh

Now run the bash script

Code:
./exec.sh

Cleaning

Unfortunatelly, determine if a word is in english language is tricky, most of the unwanted foreign words will be easily cleaned as they will be sorted after the last english word.

For example on my current generated dic

Code:
...

zygogaster

zygomatic

zygomorphic

zygomycetes

zygomycota

zygopetalinae

zygopetalon

zygopetalum

zygotaria

zygote

zymalkowski

zynetix

zysman

zytek

zytronic

zyuganov

zzzero

...

Everything after zzzero

Code:
zábřeh

záhady

záhony

zákupy

záleský

zámok

zánka

zápolya

...

électorale

électrique

éliphas

élisabeth

élite

éloize

élèves

éléonore

émigré

émigrés

émile

émilie

énergies

épinac

épiscopale

époque

épreuves

...

διοικητής

διοικηταὶ

δρουγουβιτεία

εβίνα

εθνική

εκλογική

ελλάδα

εσφούγγιζε

ευθύμης

εὐλόγιος

εὔρωψ

θέλεις

θαυμαζω

θεσσαλιῶτις

θεῖος

θρασὺς

...

Starting from there, other non english words can be found sorted between english word, this is the part I think can be improved.

Improving the script

The script can definitively be improved, I'm thinking adding a regex to exclude characters found in words like "divisão" or "phước", or using python library that can do a better job than the function .is_alpha()

Another way could be to change the source, instead of wikipedia, using the API of the New york time to scrap word inside article

I welcome any idea, suggestion or "contribution" to make this little project better. Just keep in mind I like to keep thing as simple as they can be

Here a generated wordlist made with this "tool", with a dirty fast cleaning, over 137770 words generated in ~40 hours of run

https://0x0.st/X6n3.txt

Thank you !

RE: a DIY wordlist generator - RealEnder - 10-18-2024

Wow, please don't do it this way. No need to hammer Wikipedia's site - it's slow and non-productive.
Wikipedia has dumps here: https://dumps.wikimedia.org . Just parse them and extract the words. The nice thing here is you can also pull article versions and catch mistypings and other interesting stuff.
Back in 2011 I wrote a similar script, which parses the dump on the fly and puts the wordlist in sqlite DB. It's python2 and written for targeted case, but can be easily changed to work. Check it out here: https://sec.stanev.org/?download
If there is an interest, I can shape it a bit.

RE: a DIY wordlist generator - DanielG - 10-18-2024

Agreed, this is also created at https://github.com/NorthwaveSecurity/wikiraider

"WikiRaider enables you to generate wordlists based on country specific databases of Wikipedia. This will provide you with not only a list of words in a specific language, it will also provide you with e.g. country specific artists, TV shows, places, etc."

RE: a DIY wordlist generator - bored_dude - 10-18-2024

Thanks guys, very useful infos !

I will definitively remake the script to use https://dumps.wikimedia.org instead and take at the code of wikiraider