Modifying hashcats optimized kernel
#1
So i'm trying to modify the optimized kernel of hashcat and bring its speed up for certain attacks I prefer to run.


Long story short - optimized kernels on MD5(Mode 0) with (attack mode 0) run a maximum password length of 31. I think i've narrowed it down to this file here:


https://github.com/hashcat/hashcat/blob/...timized.cl

Running a large dictionary file with large rules - using maximum lengths don't work in conjunction with optimized kernels - so i'm just trying to bring down the maximum length under optimized kernels to 18 from 31. Possibly even have a version with a max of 14.
Reply
#2
I don't understand your request here, you want to lower the maximum allowed password length of the kernel? Why? You likely won't gain much, if any speed by doing so.
Reply
#3
(04-20-2021, 01:56 AM)Chick3nman Wrote: I don't understand your request here, you want to lower the maximum allowed password length of the kernel? Why? You likely won't gain much, if any speed by doing so.

Yup you are exactly correct and i've worked it out to be around an average of 32% increase in speed, which adds up very quickly when trying to get lists that have been untouched for years.
Reply
#4
How did you work that speed increase out? Genuinely curious where it's coming from, i could play with doing this but im not sure its actually going to do much.
Reply
#5
(04-20-2021, 03:59 AM)Chick3nman Wrote: How did you work that speed increase out? Genuinely curious where it's coming from, i could play with doing this but im not sure its actually going to do much.

I ran a bunch of lists with the Fordyv1 Rule as well as best64 and a few others and then wrote them out/while dumping at the same time to collect stats on how many times rules on average create duplicate attempts as well as how well optimized kernels work. (Which of course they work really really well) - however (WELL) over 30% were above 18 characters in length which is GREAT with optimized kernels in normal cracking but when trying to nail a speed vs success rate optimal ratio, i'd like to bring that optimized kernel down.
Reply
#6
So you just want to reduce the length of valid passwords? Because you don't gain speed in the kernel computation itself in doing so, only overall work in terms of reduced keyspace. You could do this with keyspace ordering too, without modifications to the kernel.
Reply
#7
(04-20-2021, 07:35 AM)Chick3nman Wrote: So you just want to reduce the length of valid passwords? Because you don't gain speed in the kernel computation itself in doing so, only overall work in terms of reduced keyspace. You could do this with keyspace ordering too, without modifications to the kernel.

Yeah so the overall idea was to reduce the overall work AFTER the kernel computation on wordlist+rule set.
I'm unsure as to what you mean by keyspace ordering in this sense if you can possibly elaborate. If it achieves the end result of equal speed but less work/reduced keyspace then yeah, that's exactly what i'm after.


Changing the minimum would also be pretty effective to stop repeated attempts under say "7" -
Around 15% with the average wordlist + decent ruleset repeats under 7.
Reply
#8
So it sounds like, but correct me if i'm wrong here, you have a wordlist and some rules you are using for your attack. You want to limit the length of passwords that are tested by modifying the kernel to limit the length of passwords, instead of using the < or > rules since those don't work on GPU(or are really slow). In doing so, you will cut down on the total amount of stuff you test, cutting down your keyspace, and resulting in a fast attack overall. Right?

The problem with this is you are assuming that there is no penalty for kernel password rejections or that its significantly faster to reject a plain than it is to hash it. This assumption is incorrect, rejections on the GPU are pretty much always going to be slow as the candidate has already made it into the buffer, and you will likely not save nearly as much time as you'd think by rejecting them at that stage. With the way threads are executed in parallel groups, any thread that rejects all its candidates for its work chunk has to wait for the rest of the threads in its group to finish before it can get more work. You should be cutting the words that are too long out before passing them to the GPU and/or putting them in the buffer in the first place.

Now, there's a problem with doing this method too, which I'm sure you've run into. Processing all those rules on the host before sending to the candidates to the GPU is ALSO slow, otherwise you could just use -j to set a < or > rule. So you need to strike a balance. If your keyspace is manageable enough to place on disk, you could possibly save time by generating the whole thing, cutting out the words that are too long, and running that, but it's unlikely unless you are generating a LOT of words that are too long. If your keyspace is not that small, then you will need to filter your wordlist and rules to cut down on passwords that are too long for your usecase prior to running the attack or deal with the speed as is. Hashcat already actually takes advantage of a speed boost for lists of words that are short enough. You can see that here: https://github.com/hashcat/hashcat/blob/...2981-L2991

In an ideal scenario, you could generate your whole keyspace, cut out the words that are too long, and order the remaining words into chunks that fall into the buffer sizes used in hashcat for better speeds, putting the faster candidates first in the keyspace. If you order and chunk your keyspace this way, starting with the shorter passwords first, you will theoretically clear the entire keyspace faster than if they were distributed randomly in a single file. This is what i meant by "keyspace ordering" but in a more general sense because length is not the only defining feature you can use to increase overall attack speeds. You can order keyspaces by stuff such as most likely based on external metrics, cracking the majority of what you are going to crack earlier into the attack and reducing overall attack time. This is what we do with the markov chains used by the mask attacks, put more likely candidates earlier into the keyspace and less likely at the end. Now, other bottlenecks exist, but the concept remains the same. Using rules to create more work that specifically do not alter the length of the passwords to be significantly longer can be a great way to increase utilization AND recover significant speed, avoiding a number of slowdowns.

If i got my understanding of the problem wrong, then there may still be room for improvement in your attack via a kernel mod, but as far as trying to use the kernel to reject candidates, that's already a bit too late to gain significant speed and your efforts would be far better focused dealing with the keyspace _prior_ to loading. As far as trying to speed up MD5 by hashing less data, hashcat is already doing that about as fast as makes sense, so not much room there either, at least within hashcat.
Reply
#9
(04-21-2021, 01:51 AM)Chick3nman Wrote: So it sounds like, but correct me if i'm wrong here, you have a wordlist and some rules you are using for your attack. You want to limit the length of passwords that are tested by modifying the kernel to limit the length of passwords, instead of using the < or > rules since those don't work on GPU(or are really slow). In doing so, you will cut down on the total amount of stuff you test, cutting down your keyspace, and resulting in a fast attack overall. Right?

The problem with this is you are assuming that there is no penalty for kernel password rejections or that its significantly faster to reject a plain than it is to hash it. This assumption is incorrect, rejections on the GPU are pretty much always going to be slow as the candidate has already made it into the buffer, and you will likely not save nearly as much time as you'd think by rejecting them at that stage. With the way threads are executed in parallel groups, any thread that rejects all its candidates for its work chunk has to wait for the rest of the threads in its group to finish before it can get more work. You should be cutting the words that are too long out before passing them to the GPU and/or putting them in the buffer in the first place.

Now, there's a problem with doing this method too, which I'm sure you've run into. Processing all those rules on the host before sending to the candidates to the GPU is ALSO slow, otherwise you could just use -j to set a < or > rule. So you need to strike a balance. If your keyspace is manageable enough to place on disk, you could possibly save time by generating the whole thing, cutting out the words that are too long, and running that, but it's unlikely unless you are generating a LOT of words that are too long. If your keyspace is not that small, then you will need to filter your wordlist and rules to cut down on passwords that are too long for your usecase prior to running the attack or deal with the speed as is. Hashcat already actually takes advantage of a speed boost for lists of words that are short enough. You can see that here: https://github.com/hashcat/hashcat/blob/...2981-L2991

In an ideal scenario, you could generate your whole keyspace, cut out the words that are too long, and order the remaining words into chunks that fall into the buffer sizes used in hashcat for better speeds, putting the faster candidates first in the keyspace. If you order and chunk your keyspace this way, starting with the shorter passwords first, you will theoretically clear the entire keyspace faster than if they were distributed randomly in a single file. This is what i meant by "keyspace ordering" but in a more general sense because length is not the only defining feature you can use to increase overall attack speeds. You can order keyspaces by stuff such as most likely based on external metrics, cracking the majority of what you are going to crack earlier into the attack and reducing overall attack time. This is what we do with the markov chains used by the mask attacks, put more likely candidates earlier into the keyspace and less likely at the end. Now, other bottlenecks exist, but the concept remains the same. Using rules to create more work that specifically do not alter the length of the passwords to be significantly longer can be a great way to increase utilization AND recover significant speed, avoiding a number of slowdowns.

If i got my understanding of the problem wrong, then there may still be room for improvement in your attack via a kernel mod, but as far as trying to use the kernel to reject candidates, that's already a bit too late to gain significant speed and your efforts would be far better focused dealing with the keyspace _prior_ to loading. As far as trying to speed up MD5 by hashing less data, hashcat is already doing that about as fast as makes sense, so not much room there either, at least within hashcat.

This is been the exact situation, however the lists are far too much even for my storage capacities. 

I believe even with the rejections, the speed increase will be drastic with modifying the kernel.
The main reason for this which you can test yourself is just file in with some rules to make LONG passwords then run it with optimized kernels - and compare the speed difference with and without.
There is the only other possibility which definitely can be done but takes a lot longer, extract all length passwords into individual files from my current dictionaries and then separate ALL my rules into sections dedicated to the lengths of the dictionaries. This is a very time consuming process with over 360,000 rules to go through but is possible.
I was just hoping modifying the kernel wouldn't be overly difficult to have for situations where it would be helpful.
Reply
#10
"The main reason for this which you can test yourself is just file in with some rules to make LONG passwords then run it with optimized kernels - and compare the speed difference with and without."

I'm not sure what you mean here

With and without the rules? Because that will always benefit the rules from a workload perspective, and will be slower than if you didn't generate the failing words to begin with but kept the same rule related workload. With and without optimized kernel also doesn't make sense, because optimized kernels are doing far more than simply rejecting words because they are too long, that's just a side effect of the optimization. Long passwords, beyond the point of rejection, in large numbers will incur a penalty to your cracking speed, up to the point where no candidates are being hashed in most work groups. At that point, you may see some speed return but it will not be ideal or likely very performant overall.

With 360,000 rules you are well out of reasonable territory for a manageable keyspace, you are correct. That rule set is unreasonably large for an efficient attack setup, but to each their own, i have played with similarly large rule sets a fair amount but never for a normal attack.
Reply