11-05-2022, 01:46 PM
(10-03-2022, 06:38 PM)Chick3nman Wrote: Some of these hardware recommendations come from different places and achieve different results. Let me see if I can help here:
>Person 1 - "You need x2 the host RAM as you do the GPUs RAM"
>Person 2 - "You only need enough host RAM to run the OS"
The x2 host RAM recommendation comes from a historical requirement that is no longer as strong of a requirement(I'll go get that page updated), but it is the more correct of the two recommendations. The current recommendation I tend to make is host RAM >= combined VRAM, with 2x being more of a suggestion to ease some other problems that you may run into down the road like sorting and such. The reason host RAM needs to be >= combined VRAM is because there are times when the runtime will try to make allocations in host RAM that correspond to allocations made in VRAM. It may not, and most of the time will not end up actually using 100% of it's allocated host RAM when this happens but if it can't complete this step, the runtime will error out. Can you run with less RAM? Absolutely, plenty of attacks will still work. But the first time you try to start an attack and run into the "CL_OUT_OF_RESOURCES" or other memory issues, you will not be able to run those attacks until your host RAM is increased to this ">= combined VRAM" threshold or above.
>Person 1 - "I notice a significant drop in performance if the GPUs use less than 4 PCIe lanes"
>Person 2 - "You won't even saturate a single PCIe lane"
Person 1 is correct here, anything below x4 PCI-E 3.0 lanes is the point at which we tend to see lowered performance. The reason for this lowered performance is not simply "saturation of available bandwidth" but also relates to transactions per second per channel/lane and how these transactions are synchronized/wait for each other I believe. Hashcat goes through quite a bit of effort to send small compressed data across the bus because this is often the slowest part of kernel execution and the GPU will be waiting around for work while we try to load it. This is more impactful for some attack modes than others though, and this is why this can be a bit confusing. For example, -a 3 (mask or bruteforce) does not have to stream candidates across the bus as they are generated on the GPU, so it will see less impact from lowered host<->GPU communication speeds. -a 0 (straight or wordlist) streams candidates across the bus in a compressed form in small chunks and they are decompressed on the GPU during execution, which means the bus has a greater impact here. Adding in other flags such as -S which moves candidate processing to the host and streams candidates across the GPU for minimal/no GPU side processing other than hashing will feel an even greater impact. This is all rather variable as some algorithms are slower and so it doesn't matter if loading takes longer since the GPU will be busy anyway and we can get it plenty of work. Some algorithms are so fast that getting it enough work is a serious challenge.
>Person 1 - "Run balanced hardware"
>Person 2 - "A cheap CPU is all you need"
Person 1 is correct here. Hashcat is much more like other HPC workloads than it is like workloads such as mining. Comparing it to Machine Learning is probably going to do better overall. We are streaming candidates to the GPUs, doing host side decryption hooks mid kernel, applying rules on host side sometimes, compressing candidates and streaming to multiple cards, etc. all at the same time when an attack is running. To feed a modern GPU enough work during some attacks on some algorithms, it takes a pretty hefty toll on your CPU. Normally the recommendations I would make and that I've seen others make is to have 1 modern CPU core per large GPU, and maybe 1 or 2 cores extra for all other host side tasks and OS overhead. So if you have a rig with 4x 3090, you should have a relatively modern 6 core or greater CPU. A larger, more capable CPU is also going to be required to meet those host RAM requirements detailed earlier. For 4x 3090, you are looking at 96gb of combined VRAM that you may need to match for allocation depending on the attack. That means you should have at least 96gb of host RAM to avoid any issues and the next logical RAM size is likely to be 128gb. Finding "a cheap CPU" that has both 6 cores and supports 128gb of RAM is clearly not going to happen with the little atom and pentium chips that miners are fond of. Mining has almost no host resource requirement at all, it is nearly entirely GPU side and done with as many corners cut as possible. Hashcat is not that way and if you want to utilize it effectively, you need to have hardware that is capable of doing what you've asked it to.
This may not answer ALL of your questions, but hopefully it helps. In reality, there's always going to be people who say things like "I have 8x 3090 on x1 risers, a 2 core cpu, and 4gb of RAM and i never notice any problems!" but as someone who spent 3 years building and testing high end hardware(8-10 GPU commercial servers specifically for use w/ hashcat), I can tell you that they are either not noticing or are ignoring the % losses and little issues that pile up or they have been lucky up to that point that they haven't tried to run a more advanced attack and found the hard limits of their setup. You can definitely do a lot with minimal hardware like that, but there are plenty of things you won't be able to do as well and you don't want to learn that the hard way when you're, for example, on a time limit to crack something for a pentest engagement and find out the attack you need to run simply wont work on your setup.
Wow, this is INCREDIBLE information.
There's one behavior IĀ don't full understand in Hashcat - with large hash lists (3million+) i can see the cracking speed cut in half, and I've seen that on several rigs so far - which hardware component is responsible for that? or is that just a software bottleneckĀ that can't be overcome?