How hashcat deal with a list of hash?
#1
Question 
At the end of the day, all hashcat does is just this (in python):

Code:
list_of_hash = []          
with open("wordlist.txt","r") as f:
    for line in f:                 
        calculate = hashlib.md5(line)
        for h in list_of_hash:
            if h == calculate:
                print(f"{h}:{line}")

My question is about this list_of_hash. Let's say I have 100k unsalted md5 hashes that I want to try. What is the best option the program does:
  • Load all hashes in the memory 
  • Load chunks of hashes every time. In this mode, it will run all lines of wordlist with the first 10k hashes, than loads more 10k and rerun the process
When will be too slow to loads all hashes in the memory? Does I desire the program to to load all hashes or this can make the program slower? What size is the best size? Where can I configure it?

Ps: All this questions is just for fast unsalted hashes
Reply
#2
Hi rodrigo
If hashcat can load password lists of 2bil+ lines fine for me, then I assume if you save all the 100k hashs as a .txt and rent a rig with a good amount of memory you shouldnt have any problems loading them in? I think it depends more on your computers memory than size of the hash list. Do you get a out of memory error when trying to load?
Reply
#3
(07-20-2023, 03:38 AM)rodrigo.Brasil Wrote: At the end of the day, all hashcat does is just this (in python):

Code:
list_of_hash = []          
with open("wordlist.txt","r") as f:
    for line in f:                 
        calculate = hashlib.md5(line)
        for h in list_of_hash:
            if h == calculate:
                print(f"{h}:{line}")

This is NOT a good representation of how hashcat works and will give you a poor understanding of the parallel processing that takes places. Internally, hashcat executes many many parallel threads to do work, it doesn't just iterate over things so simply. To do comparisons and lookups against a list of hashes, hashcat processes the input hashes into a "bitmap" or "bloom filter" and uses that for searches. We don't just load all hashes into memory, that would be both slow and wasteful.
Reply
#4
first:
your double for loop will result in an unneccessary overhead as you are running each password against nearly (until you find your hash) your whole list_of_hash, see the end of the post for a better (there is quite still better ways but i'll keep it simple) pythonic approach

second: lets have some fun with python

Code:
import hashlib
import sys
import tracemalloc

def md5_100k():
    result = []
    for i in range(1,100001):
        md5_hash = hashlib.md5(str(i).encode()).hexdigest()
        result.append(md5_hash)
    return result

tracemalloc.start()
result = md5_100k()
print(tracemalloc.get_traced_memory())
tracemalloc.stop()

output is (8906110, 8906274) so the whole program (with a 100k md5 hashlist in memory) consumes around 9 megabyte in total which is yeah cute isnt it? so yeah, load your hashes right away into ram

third, drop the second loop and dont forget to strip the newline from input (or you will generate and checking wrong md5)

so in total and depending on the size of your wordlist ( i use readlines to also read input file cmplete into memory), i added a modula counter for showing how "slow" this single threaded program is) lasted around ~50 seconds for hashing and comparing 10k real inputs

Code:
import hashlib

def md5_100k():
    result = []
    for i in range(1,100001):
        md5_hash = hashlib.md5(str(i).encode()).hexdigest()
        result.append(md5_hash)
    return result

list_md5 = md5_100k()

with open("wordlist.txt","r") as f:
    file_content = f.readlines()
    counter = 1
    for line in file_content:
        if counter % 100 == 0:
            print(counter)
        line = line.strip()
        calculate = (hashlib.md5(line.encode()).hexdigest())
        if calculate in list_md5:
            print(f"{calculate}:{line}")
        counter += 1
Reply