How to filter duplicate content in large dictionary files and multiple dictionaries?

How to filter duplicate content in large dictionary files and multiple dictionaries? - Printable Version

+- hashcat Forum (https://hashcat.net/forum)
+-- Forum: Support (https://hashcat.net/forum/forum-3.html)
+--- Forum: hashcat-utils, maskprocessor, statsprocessor, md5stress, wikistrip (https://hashcat.net/forum/forum-28.html)
+--- Thread: How to filter duplicate content in large dictionary files and multiple dictionaries? (/thread-12162.html)

How to filter duplicate content in large dictionary files and multiple dictionaries? - mima8cn - 09-18-2024

The large dictionary I downloaded online is about 150GB. How can I filter out my existing dictionary files? My computer has 80GB of memory, and when using rli.exe files that exceed 20GB, it reports insufficient memory. Is there any other way to filter out duplicate content in other files!
The operating system is Windows

In addition, there is a single file of 150GB. Is there any way to filter out duplicate content in a single file?

RE: How to filter duplicate content in large dictionary files and multiple dictionaries? - Snoopy - 09-18-2024

(09-18-2024, 10:26 AM)mima8cn Wrote: The large dictionary I downloaded online is about 150GB. How can I filter out my existing dictionary files? My computer has 80GB of memory, and when using rli.exe files that exceed 20GB, it reports insufficient memory. Is there any other way to filter out duplicate content in other files!
The operating system is Windows

In addition, there is a single file of 150GB. Is there any way to filter out duplicate content in a single file?

depending on the amount of passwords / size of your dictionaries and the attacked hashtype i would assume to just leave it this way

when attacking fast hashes like NTLM or MD5 small dictionaries will be processed almost instantly, therefore filtering out your dictionary files would take more time then just hashing them again

anyway, you could utilize the windows subshell for linux and tools like sort and comm but for this, but you need to sort your input beforehand, so this will also take some time to prepare all of your input files, not quite sure whether sort can handle files that big or not

jfyi

big.txt (after sort)

smalltxt

Code:
3

5

7

Code:
comm -23 big.txt small.txt > unig-big.txt

would result in uniq lines big.txt minus small.txt

RE: How to filter duplicate content in large dictionary files and multiple dictionaries? - mima8cn - 09-19-2024

(09-18-2024, 05:15 PM)Snoopy Wrote:
(09-18-2024, 10:26 AM)mima8cn Wrote: The large dictionary I downloaded online is about 150GB. How can I filter out my existing dictionary files? My computer has 80GB of memory, and when using rli.exe files that exceed 20GB, it reports insufficient memory. Is there any other way to filter out duplicate content in other files!
The operating system is Windows

In addition, there is a single file of 150GB. Is there any way to filter out duplicate content in a single file?

depending on the amount of passwords / size of your dictionaries and the attacked hashtype i would assume to just leave it this way

when attacking fast hashes like NTLM or MD5 small dictionaries will be processed almost instantly, therefore filtering out your dictionary files would take more time then just hashing them again

anyway, you could utilize the windows subshell for linux and tools like sort and comm but for this, but you need to sort your input beforehand, so this will also take some time to prepare all of your input files, not quite sure whether sort can handle files that big or not

jfyi

big.txt (after sort)

Code:
1 10 2 3 4 5 6 7 8 9

smalltxt

Code:
3 5 7

Code:
comm -23 big.txt small.txt > unig-big.txt

would result in uniq lines big.txt minus small.txt

Code:
1 10 2 4 6 8 9

This command is for Linux systems, and my system is Windows. I want to know how Windows can filter out duplicate content and files

RE: How to filter duplicate content in large dictionary files and multiple dictionaries? - Snoopy - 09-19-2024

thats why i said WSL -> Windows Subsystem for Linux, its a software designed by Microsoft for running a linux distribution seamless integrated in Windows 10 and 11.

On newer Windows 10/11 you can install it via powershell, just run

./wsl --install

see https://learn.microsoft.com/de-de/windows/wsl/install for more informations

Windows don''t have such build-in tools, you have to use a programming language like python or third party programs, but i don't know a programm for your problem

RE: How to filter duplicate content in large dictionary files and multiple dictionaries? - b8vr - 09-20-2024

I recommend using rling:

https://github.com/Cynosureprime/rling

It is very fast.

From the examples section in the repo:

There are many common, and not so common uses for rling.
rling big-file.txt new-file.txt /path/to/old-file.txt /path/to/others/*
This will read in big-file.txt, remove any duplicate lines, then check /path/to/old-file.txt and all files matching /path/to/others/*. Any line that is found in these files that also exists in big-file.txt will be removed. Once all files are processed, new-file.txt is written with the lines that don't match. This is a great way to remove lines from a new dictionary file, if already have them in your existing dictionary lists.