False temp warnings, but Mhz dips?
#1
Hello everyone.
We recently purchased an 8xGPU rig for the company I work for.

TYAN 4U (B7129F83AV14E8HR-N)
2*Intel CPU XEON Silver 4316
8*INNO3D RTX3090 N30901-246X-6261960 BLOWER
6*64GB 3200MHz RAM
4800w (+1600w fallback/backup)

We installed Win 10 enterprise as we are going to significantly upgrade the ram (at least 2TB) next year, all Nvidia, Cuda and the rack native drivers have been installed smoothly with zero issues. Performance appears to be amazing, and the quality server room means we haven't seen temps go over 70c yet. Benchmarks are identical to other 3090 seen online, and we're getting awesome cracking speeds (~2h15m for 8*?a on single NTLM hash)

However, we're getting what appears to be false "Driver temperature threshold met on CPU #X. Expect reduced performance" every few minutes (not always on the same GPU).
I hit [s] immediately after getting the warning and while the temps are always fine (70c. and even 40c sometimes), I do see low Utilization and MHz for a few seconds (next status is 100% file).
I also ran HWMonitor along Hashcat for a few hours while getting these warnings every few minutes  and Max Temperatures (Video/Memory/Hot Spot) have never been above 70c. The MHz dips are documented/logged in HWMonitor and appear instantly after the first warning message pops up in Hashcat.

I've tried multiple attacks with different loads (all using -O and -w4 / -w3 and on NTLM), this image is just an example of the behavior, but this also happens in a straight ?a?a?a?a?a?a?a?a attack, or any other long/short term attack:
https://i.postimg.cc/DwCCjdf5/example.png
In this case the warning was on #6, while it was actually fine, very odd behavior all around.

I would really appreciate some guidance here,
Can something else triggers the temp warnings?
Are the warnings causing the MHz drop?
What can i do to identify the cause of the issue?
Reply
#2
Try command --hwmon-disable
Reply
#3
(11-05-2022, 09:29 PM)marc1n Wrote: Try command --hwmon-disable

Does that also mean that if the GPUs actually reach 90c Hashcat will no longer stop? or is that a different things
Reply
#4
(11-05-2022, 10:31 PM)cybhashcat Wrote:
(11-05-2022, 09:29 PM)marc1n Wrote: Try command --hwmon-disable

Does that also mean that if the GPUs actually reach 90c Hashcat will no longer stop? or is that a different things

Disable temperature and fanspeed reads and triggers
Reply
#5
(11-05-2022, 10:51 PM)marc1n Wrote:
(11-05-2022, 10:31 PM)cybhashcat Wrote:
(11-05-2022, 09:29 PM)marc1n Wrote: Try command --hwmon-disable

Does that also mean that if the GPUs actually reach 90c Hashcat will no longer stop? or is that a different things

Disable temperature and fanspeed reads and triggers

Sorry, my question may be been unclear.
If I use --hwmon-disable will Hashcat still ABORT in high tempratuers?
Reply
#6
(11-06-2022, 12:07 PM)cybhashcat Wrote:
(11-05-2022, 10:51 PM)marc1n Wrote:
(11-05-2022, 10:31 PM)cybhashcat Wrote:
(11-05-2022, 09:29 PM)marc1n Wrote: Try command --hwmon-disable

Does that also mean that if the GPUs actually reach 90c Hashcat will no longer stop? or is that a different things

Disable temperature and fanspeed reads and triggers

Sorry, my question may be been unclear.
If I use --hwmon-disable will Hashcat still ABORT in high tempratuers?

YES
Reply
#7
(11-05-2022, 01:30 PM)cybhashcat Wrote: However, we're getting what appears to be false "Driver temperature threshold met on CPU #X. Expect reduced performance" every few minutes (not always on the same GPU).


This may just be a bit of a misunderstanding. These are not "false" warnings, they are mostly* real. They are just not as serious as they may seem. The warning pops due to a driver reported value, which on many modern GPUs is reported as ~65C. This value is where the GPU starts to lower clock speeds from the maximum boost clock/bin that it has achieved for it's power budget. This clock speed reduction is done in small steps as temperatures increase and is not generally very noticeable until very high temperatures, it's nothing to really worry about. With the way modern GPUs boost, you may still be running at speeds over the rated spec even after this threshold. Usually if you are running at a relatively cool temperature, this happens more often as the GPU temperature bounces around just at the threshold value, making the warning pop repeatedly.

Now, I put a * on "mostly real" because I've seen quite a few users reporting this behavior at temperatures well below/above expected. This behavior is sometimes inconsistent and I've not tracked down _why_ it happens but it generally appears as though the driver has reported a temperature threshold value that doesn't make sense. Hashcat's warning logic just runs off that reported value so if the value we get is junk or in the wrong format or something, it can cause the warning to pop at incorrect temperatures. Unless it's causing you serious annoyance due to the number of warnings you are receiving, this is a purely visual issue and will not affect hashcat in any meaningful way, other than producing warnings at unexpected times.
Reply
#8
(11-06-2022, 04:25 PM)marc1n Wrote: YES

I tried that, and using  --hwmon-disable results in the "Watchdog" text at the beginning to switch from showing a temperature to saying "Temperature abort trigger disabled".
I even tried adding "--hwmon-temp-abort=80" but that did not change anything.

(11-07-2022, 08:42 PM)Chick3nman Wrote:
(11-05-2022, 01:30 PM)cybhashcat Wrote: However, we're getting what appears to be false "Driver temperature threshold met on CPU #X. Expect reduced performance" every few minutes (not always on the same GPU).


This may just be a bit of a misunderstanding. These are not "false" warnings, they are mostly* real. They are just not as serious as they may seem. The warning pops due to a driver reported value, which on many modern GPUs is reported as ~65C. This value is where the GPU starts to lower clock speeds from the maximum boost clock/bin that it has achieved for it's power budget. This clock speed reduction is done in small steps as temperatures increase and is not generally very noticeable until very high temperatures, it's nothing to really worry about. With the way modern GPUs boost, you may still be running at speeds over the rated spec even after this threshold. Usually if you are running at a relatively cool temperature, this happens more often as the GPU temperature bounces around just at the threshold value, making the warning pop repeatedly.

Now, I put a * on "mostly real" because I've seen quite a few users reporting this behavior at temperatures well below/above expected. This behavior is sometimes inconsistent and I've not tracked down _why_ it happens but it generally appears as though the driver has reported a temperature threshold value that doesn't make sense. Hashcat's warning logic just runs off that reported value so if the value we get is junk or in the wrong format or something, it can cause the warning to pop at incorrect temperatures. Unless it's causing you serious annoyance due to the number of warnings you are receiving, this is a purely visual issue and will not affect hashcat in any meaningful way, other than producing warnings at unexpected times.

First of all, Chick3nman, thank you for taking the time to answer.
the very very quick dips to 300~ MHz are like you said, an annoyance, but i was fearing that in the long run it may leave performance on the table (a minute here a minute there), where it really shouldn't be happening if we never go over 70c.

If I understand correctly, this is due to the Nvidia driver reporting to hashcat, is there any way to increase this from 65 to lets say 75 in the driver? or alternatively, achieve what marc1n was talking about, disable these warnings and dips but keep the abort temp at 90.
thanks again.
Reply