10-15-2024, 03:11 AM
(This post was last modified: 10-15-2024, 03:20 AM by bored_dude.)
Hi,
Introduction
A little contributation to whom may be interested to make their own wordlist, with continuous update, lightweight and simple as hell
The idea behind it was to find a (almost) limitless source of english words with constant update. As the english language continually evolve, having a "tool" to generate new words make sense.
The generator will use as source wikipedia, in particular the random article : https://en.wikipedia.org/wiki/Special:Random
This page on trigger, will redirect to a random article.
This could work on Windows, though the following instruction are for Linux (but definitively adaptable for Windows)
Getting Ready and Started
Create a new directory and go into
Create the wordlist file
Create the scrapping script (I use nano but any editor will work)
Insert inside
Make the script executable
Create the bash script that will act as a process and execute the python script every X seconds (timer can be change here)
Insert inside
Make the bash script executable
Now run the bash script
Cleaning
Unfortunatelly, determine if a word is in english language is tricky, most of the unwanted foreign words will be easily cleaned as they will be sorted after the last english word.
For example on my current generated dic
Everything after zzzero
Starting from there, other non english words can be found sorted between english word, this is the part I think can be improved.
Improving the script
The script can definitively be improved, I'm thinking adding a regex to exclude characters found in words like "divisão" or "phước", or using python library that can do a better job than the function .is_alpha()
Another way could be to change the source, instead of wikipedia, using the API of the New york time to scrap word inside article
I welcome any idea, suggestion or "contribution" to make this little project better. Just keep in mind I like to keep thing as simple as they can be
Here a generated wordlist made with this "tool", with a dirty fast cleaning, over 137770 words generated in ~40 hours of run
https://0x0.st/X6n3.txt
Thank you !
Introduction
A little contributation to whom may be interested to make their own wordlist, with continuous update, lightweight and simple as hell
The idea behind it was to find a (almost) limitless source of english words with constant update. As the english language continually evolve, having a "tool" to generate new words make sense.
The generator will use as source wikipedia, in particular the random article : https://en.wikipedia.org/wiki/Special:Random
This page on trigger, will redirect to a random article.
This could work on Windows, though the following instruction are for Linux (but definitively adaptable for Windows)
Getting Ready and Started
Create a new directory and go into
Code:
mkdir scrap && cd scrap
Create the wordlist file
Code:
touch dic.txt
Create the scrapping script (I use nano but any editor will work)
Code:
nano wikipedia_scrap.py
Insert inside
Code:
#!/usr/bin/env python3
import pycurl
from io import BytesIO
from bs4 import BeautifulSoup
buffer = BytesIO()
c = pycurl.Curl()
c.setopt(c.URL, "https://en.wikipedia.org/wiki/Special:Random")
c.setopt(c.FOLLOWLOCATION, True)
c.setopt(c.WRITEDATA, buffer)
c.perform()
c.close()
html = buffer.getvalue().decode("utf-8")
# GET HTML
soup = BeautifulSoup(html, "html.parser")
# GET URL from <link rel="canonical"
rurl = soup.find('link', {'rel' : 'canonical'}).get("href")
# Print the current (redirected url) where the scrap will happen
print(rurl)
# GET all <p> from soup var
ptext = soup.find_all('p')
# Extract text from <p>
for p in ptext:
arr = p.text.split()
for words in arr:
# Need more work, remove punctuation
words = words.replace(',','').replace('.','').replace('(','').replace(')','').lower()
if words.isalpha():
# Only words that have at least 5 chars
if len(words) > 4:
f = open("dic.txt","a")
f.write(words + '\n')
f.close()
Make the script executable
Code:
chmod +x wikipedia_scrap.py
Create the bash script that will act as a process and execute the python script every X seconds (timer can be change here)
Code:
nano exec.sh
Insert inside
Code:
#!/bin/bash
# On exit script (ctrl+c or kill), sort alphabetically and clean any double (or triple...) entries
trap "sort -u dic.txt > temp && mv temp dic.txt && exit" SIGINT
while :
do
./wikipedia_scrap.py
# Set a pause between wikipedia request, can be change for lower value (unsure how many request per minute wikipedia will allow)
sleep 10
done
Make the bash script executable
Code:
chmod +x exec.sh
Now run the bash script
Code:
./exec.sh
Cleaning
Unfortunatelly, determine if a word is in english language is tricky, most of the unwanted foreign words will be easily cleaned as they will be sorted after the last english word.
For example on my current generated dic
Code:
...
zygogaster
zygomatic
zygomorphic
zygomycetes
zygomycota
zygopetalinae
zygopetalon
zygopetalum
zygotaria
zygote
zymalkowski
zynetix
zysman
zytek
zytronic
zyuganov
zzzero
...
Everything after zzzero
Code:
zábřeh
záhady
záhony
zákupy
záleský
zámok
zánka
zápolya
...
électorale
électrique
éliphas
élisabeth
élite
éloize
élèves
éléonore
émigré
émigrés
émile
émilie
énergies
épinac
épiscopale
époque
épreuves
...
διοικητής
διοικηταὶ
δρουγουβιτεία
εβίνα
εθνική
εκλογική
ελλάδα
εσφούγγιζε
ευθύμης
εὐλόγιος
εὔρωψ
θέλεις
θαυμαζω
θεσσαλιῶτις
θεῖος
θρασὺς
...
Starting from there, other non english words can be found sorted between english word, this is the part I think can be improved.
Improving the script
The script can definitively be improved, I'm thinking adding a regex to exclude characters found in words like "divisão" or "phước", or using python library that can do a better job than the function .is_alpha()
Another way could be to change the source, instead of wikipedia, using the API of the New york time to scrap word inside article
I welcome any idea, suggestion or "contribution" to make this little project better. Just keep in mind I like to keep thing as simple as they can be
Here a generated wordlist made with this "tool", with a dirty fast cleaning, over 137770 words generated in ~40 hours of run
https://0x0.st/X6n3.txt
Thank you !