How hashcat uses the hardware?
#1
So, going through Google (and even these forums) you get a ton of conflicting information

Person 1 - "You need x2 the host RAM as you do the GPUs RAM"
Person 2 - "You only need enough host RAM to run the OS"

Person 1 - "I notice a significant drop in performance if the GPUs use less than 4 PCIe lanes"
Person 2 - "You won't even saturate a single PCIe lane"

Person 1 - "Run balanced hardware"
Person 2 - "A cheap CPU is all you need"

It sounds like a fight between miners and people with more money than research; so here's my question.  What data is transferred when brute-forcing and when using a dictionary attack as you run hashcat?  I'll give a couple of examples here, but if anybody wants to go into a deep dive of this subject (or know of a website that does) I would appreciate that a ton.

Brute-force:
Do the GPUs talk to each other much, or report back to the processor much with information via the PCIe lanes?

Dictionary:
How fast are the GPU's ripping information from the storage drive?  Is it enough to saturate 4 lanes of PCIe 4.0 that a newer NVMe uses?

Overall:
Where can hashcat start to bottleneck hardware?
Reply
#2
Some of these hardware recommendations come from different places and achieve different results. Let me see if I can help here:

>Person 1 - "You need x2 the host RAM as you do the GPUs RAM"
>Person 2 - "You only need enough host RAM to run the OS"

The x2 host RAM recommendation comes from a historical requirement that is no longer as strong of a requirement(I'll go get that page updated), but it is the more correct of the two recommendations. The current recommendation I tend to make is host RAM >= combined VRAM, with 2x being more of a suggestion to ease some other problems that you may run into down the road like sorting and such. The reason host RAM needs to be >= combined VRAM is because there are times when the runtime will try to make allocations in host RAM that correspond to allocations made in VRAM. It may not, and most of the time will not end up actually using 100% of it's allocated host RAM when this happens but if it can't complete this step, the runtime will error out. Can you run with less RAM? Absolutely, plenty of attacks will still work. But the first time you try to start an attack and run into the "CL_OUT_OF_RESOURCES" or other memory issues, you will not be able to run those attacks until your host RAM is increased to this ">= combined VRAM" threshold or above.

>Person 1 - "I notice a significant drop in performance if the GPUs use less than 4 PCIe lanes"
>Person 2 - "You won't even saturate a single PCIe lane"

Person 1 is correct here, anything below x4 PCI-E 3.0 lanes is the point at which we tend to see lowered performance. The reason for this lowered performance is not simply "saturation of available bandwidth" but also relates to transactions per second per channel/lane and how these transactions are synchronized/wait for each other I believe. Hashcat goes through quite a bit of effort to send small compressed data across the bus because this is often the slowest part of kernel execution and the GPU will be waiting around for work while we try to load it. This is more impactful for some attack modes than others though, and this is why this can be a bit confusing. For example, -a 3 (mask or bruteforce) does not have to stream candidates across the bus as they are generated on the GPU, so it will see less impact from lowered host<->GPU communication speeds. -a 0 (straight or wordlist) streams candidates across the bus in a compressed form in small chunks and they are decompressed on the GPU during execution, which means the bus has a greater impact here. Adding in other flags such as -S which moves candidate processing to the host and streams candidates across the GPU for minimal/no GPU side processing other than hashing will feel an even greater impact. This is all rather variable as some algorithms are slower and so it doesn't matter if loading takes longer since the GPU will be busy anyway and we can get it plenty of work. Some algorithms are so fast that getting it enough work is a serious challenge.

>Person 1 - "Run balanced hardware"
>Person 2 - "A cheap CPU is all you need"

Person 1 is correct here. Hashcat is much more like other HPC workloads than it is like workloads such as mining. Comparing it to Machine Learning is probably going to do better overall. We are streaming candidates to the GPUs, doing host side decryption hooks mid kernel, applying rules on host side sometimes, compressing candidates and streaming to multiple cards, etc. all at the same time when an attack is running. To feed a modern GPU enough work during some attacks on some algorithms, it takes a pretty hefty toll on your CPU. Normally the recommendations I would make and that I've seen others make is to have 1 modern CPU core per large GPU, and maybe 1 or 2 cores extra for all other host side tasks and OS overhead. So if you have a rig with 4x 3090, you should have a relatively modern 6 core or greater CPU. A larger, more capable CPU is also going to be required to meet those host RAM requirements detailed earlier. For 4x 3090, you are looking at 96gb of combined VRAM that you may need to match for allocation depending on the attack. That means you should have at least 96gb of host RAM to avoid any issues and the next logical RAM size is likely to be 128gb. Finding "a cheap CPU" that has both 6 cores and supports 128gb of RAM is clearly not going to happen with the little atom and pentium chips that miners are fond of. Mining has almost no host resource requirement at all, it is nearly entirely GPU side and done with as many corners cut as possible. Hashcat is not that way and if you want to utilize it effectively, you need to have hardware that is capable of doing what you've asked it to.


This may not answer ALL of your questions, but hopefully it helps. In reality, there's always going to be people who say things like "I have 8x 3090 on x1 risers, a 2 core cpu, and 4gb of RAM and i never notice any problems!" but as someone who spent 3 years building and testing high end hardware(8-10 GPU commercial servers specifically for use w/ hashcat), I can tell you that they are either not noticing or are ignoring the % losses and little issues that pile up or they have been lucky up to that point that they haven't tried to run a more advanced attack and found the hard limits of their setup. You can definitely do a lot with minimal hardware like that, but there are plenty of things you won't be able to do as well and you don't want to learn that the hard way when you're, for example, on a time limit to crack something for a pentest engagement and find out the attack you need to run simply wont work on your setup.
Reply
#3
That is 100% some amazing information.  A few followup questions if I could.

Host ram - I think you've answered my questions here very well.

PCIe lanes - Current technology is that the RTX3090 TIs have the ability to run x8 4.0 PCIe lanes due to their 1008GB/s mind melting speed (I'm going to look back in 10 years at me saying this and laugh).  You mentioned it's best not to dip below x4 3.0 PCIe lanes and I'm wondering if that includes stringing together high performance GPUs like the 3090 TI that has such a high speed ability in this case? I would assume that the higher speed of the GPUs themselves may require more than just the x4 3.0 lanes?

CPU - I've got a little research to do here but I'm pretty sure you've probably nailed this one down as well.

Thank you a ton for your input!
Reply
#4
>PCIe lanes - Current technology is that the RTX3090 TIs have the ability to run x8 4.0 PCIe lanes due to their 1008GB/s mind melting speed (I'm going to look back in 10 years at me saying this and laugh). You mentioned it's best not to dip below x4 3.0 PCIe lanes and I'm wondering if that includes stringing together high performance GPUs like the 3090 TI that has such a high speed ability in this case?

The issue with PCIe lanes and trying to give good recommendations here is a bit complicated I think. There's many many variables at play that often get boiled down to just "x4 lanes". The vast majority of modern GPUs will negotiate and operate at pretty high speed(even x16 5.0 now) regardless of other factors, but that isn't the end of the story. CPUs have a limited number of PCIe lanes, motherboards can have chipset PCIe lanes that add on to the CPU count but are not quite the same, motherboards and backplanes can have PLX Chips that effectively switch PCIe communications to add even more lanes, but again this may not behave quite how you would expect. There are other devices that contend for lanes as well, including NVMe storage, Thunderbolt connections, etc. We even see lane bifurcation and duplication through cheaper PLX style chips leading to weird cases of more than 1 device per lane. All of these things complicate what we mean vs what we say when we try to discuss the hardware. For this I will try to be as explicit as possible and clear some of that up.

Hashcat's usage of PCIe lanes is mostly relatively small, fast, low latency data loading and device status queries through the runtime with the occasional device data returns such as cracked hashes. The things that can slow us down related specifically to the PCIe bus are latency increases or limitations in bandwidth/transaction rate. These issues can happen for a number of reasons such as increased error rates and TX/RX resends due to interference or poor signal quality(This happens mostly with bad risers), delays from switching due to weak/poor PLX style chips, contention with other devices, etc. It is always best to have you GPU attached to a high quality physical connection(no risers) backed by a high speed link to the CPU(no low end PLX chips). Once you have achieved those things, you almost never need to worry about the speed because most motherboards won't let you plug in more GPUs than your CPU can handle as it is. And if they do, it's highly likely that you are achieving at least x4 3.0 or better. It's usually when people start to put GPUs on risers and add more cards than would normally physically fit that they find out that only certain slots will run at the same time or that with so many cards their lane count is cut down to x1 per card. At the point where you are running into the issue of PCIe bandwidth, you've already likely done a bunch of other stuff that could contribute to degraded performance/stability so I'm not sure I would focus on it anyway.

To summarize, I wouldn't worry about PCIe lanes until you've covered all the other stuff because by the time you do cover all the other stuff, the PCIe lanes will almost surely not be a problem. Modern GPUs will run fast enough in an "approved" configuration for it to never be an issue. It's only when you start getting creative and trying to slot extra cards in where they may not have normally fit that it starts to become a problem worth considering.

Also, to touch on another subject that gets brought up a lot about GPUs and their operation in hashcat, the GPUs are treated as separate devices and do not cooperate directly with each other. Each card is initialized and run by the host individually. Technologies like NVLink/SLI/CrossFire/etc. are not currently in use and not likely to be added due to limited benefit and significant complexity for the workload. As long as your host system(CPU, RAM, etc.) can comfortably run more GPUs, you can continue to add them without worry, including some mixing of different cards, though using all of the same card is generally suggested and will simplify a number of things should you have issues.
Reply
#5
(10-03-2022, 06:38 PM)Chick3nman Wrote: Some of these hardware recommendations come from different places and achieve different results. Let me see if I can help here:

>Person 1 - "You need x2 the host RAM as you do the GPUs RAM"
>Person 2 - "You only need enough host RAM to run the OS"

The x2 host RAM recommendation comes from a historical requirement that is no longer as strong of a requirement(I'll go get that page updated), but it is the more correct of the two recommendations. The current recommendation I tend to make is host RAM >= combined VRAM, with 2x being more of a suggestion to ease some other problems that you may run into down the road like sorting and such. The reason host RAM needs to be >= combined VRAM is because there are times when the runtime will try to make allocations in host RAM that correspond to allocations made in VRAM. It may not, and most of the time will not end up actually using 100% of it's allocated host RAM when this happens but if it can't complete this step, the runtime will error out. Can you run with less RAM? Absolutely, plenty of attacks will still work. But the first time you try to start an attack and run into the "CL_OUT_OF_RESOURCES" or other memory issues, you will not be able to run those attacks until your host RAM is increased to this ">= combined VRAM" threshold or above.

>Person 1 - "I notice a significant drop in performance if the GPUs use less than 4 PCIe lanes"
>Person 2 - "You won't even saturate a single PCIe lane"

Person 1 is correct here, anything below x4 PCI-E 3.0 lanes is the point at which we tend to see lowered performance. The reason for this lowered performance is not simply "saturation of available bandwidth" but also relates to transactions per second per channel/lane and how these transactions are synchronized/wait for each other I believe. Hashcat goes through quite a bit of effort to send small compressed data across the bus because this is often the slowest part of kernel execution and the GPU will be waiting around for work while we try to load it. This is more impactful for some attack modes than others though, and this is why this can be a bit confusing. For example, -a 3 (mask or bruteforce) does not have to stream candidates across the bus as they are generated on the GPU, so it will see less impact from lowered host<->GPU communication speeds. -a 0 (straight or wordlist) streams candidates across the bus in a compressed form in small chunks and they are decompressed on the GPU during execution, which means the bus has a greater impact here. Adding in other flags such as -S which moves candidate processing to the host and streams candidates across the GPU for minimal/no GPU side processing other than hashing will feel an even greater impact. This is all rather variable as some algorithms are slower and so it doesn't matter if loading takes longer since the GPU will be busy anyway and we can get it plenty of work. Some algorithms are so fast that getting it enough work is a serious challenge.

>Person 1 - "Run balanced hardware"
>Person 2 - "A cheap CPU is all you need"

Person 1 is correct here. Hashcat is much more like other HPC workloads than it is like workloads such as mining. Comparing it to Machine Learning is probably going to do better overall. We are streaming candidates to the GPUs, doing host side decryption hooks mid kernel, applying rules on host side sometimes, compressing candidates and streaming to multiple cards, etc. all at the same time when an attack is running. To feed a modern GPU enough work during some attacks on some algorithms, it takes a pretty hefty toll on your CPU. Normally the recommendations I would make and that I've seen others make is to have 1 modern CPU core per large GPU, and maybe 1 or 2 cores extra for all other host side tasks and OS overhead. So if you have a rig with 4x 3090, you should have a relatively modern 6 core or greater CPU. A larger, more capable CPU is also going to be required to meet those host RAM requirements detailed earlier. For 4x 3090, you are looking at 96gb of combined VRAM that you may need to match for allocation depending on the attack. That means you should have at least 96gb of host RAM to avoid any issues and the next logical RAM size is likely to be 128gb. Finding "a cheap CPU" that has both 6 cores and supports 128gb of RAM is clearly not going to happen with the little atom and pentium chips that miners are fond of. Mining has almost no host resource requirement at all, it is nearly entirely GPU side and done with as many corners cut as possible. Hashcat is not that way and if you want to utilize it effectively, you need to have hardware that is capable of doing what you've asked it to.


This may not answer ALL of your questions, but hopefully it helps. In reality, there's always going to be people who say things like "I have 8x 3090 on x1 risers, a 2 core cpu, and 4gb of RAM and i never notice any problems!" but as someone who spent 3 years building and testing high end hardware(8-10 GPU commercial servers specifically for use w/ hashcat), I can tell you that they are either not noticing or are ignoring the % losses and little issues that pile up or they have been lucky up to that point that they haven't tried to run a more advanced attack and found the hard limits of their setup. You can definitely do a lot with minimal hardware like that, but there are plenty of things you won't be able to do as well and you don't want to learn that the hard way when you're, for example, on a time limit to crack something for a pentest engagement and find out the attack you need to run simply wont work on your setup.

Wow, this is INCREDIBLE information.
There's one behavior I don't full understand in Hashcat - with large hash lists (3million+) i can see the cracking speed cut in half, and I've seen that on several rigs so far - which hardware component is responsible for that? or is that just a software bottleneck that can't be overcome?
Reply
#6
Small correction, i was distracted (cant edit anymore)
the list is around 150,000 hashes, not 3M

i found this old post that says its a PCI bus thing
https://hashcat.net/forum/thread-4858-po...l#pid27272
but... the rig we use pretty expensive and high quality, so i'd like confirmation if possible.
Reply