11-05-2022, 01:30 PM
(This post was last modified: 11-05-2022, 01:32 PM by cybhashcat.)
Hello everyone.
We recently purchased an 8xGPU rig for the company I work for.
TYAN 4U (B7129F83AV14E8HR-N)
2*Intel CPU XEON Silver 4316
8*INNO3D RTX3090 N30901-246X-6261960 BLOWER
6*64GB 3200MHz RAM
4800w (+1600w fallback/backup)
We installed Win 10 enterprise as we are going to significantly upgrade the ram (at least 2TB) next year, all Nvidia, Cuda and the rack native drivers have been installed smoothly with zero issues. Performance appears to be amazing, and the quality server room means we haven't seen temps go over 70c yet. Benchmarks are identical to other 3090 seen online, and we're getting awesome cracking speeds (~2h15m for 8*?a on single NTLM hash)
However, we're getting what appears to be false "Driver temperature threshold met on CPU #X. Expect reduced performance" every few minutes (not always on the same GPU).
I hit [s] immediately after getting the warning and while the temps are always fine (70c. and even 40c sometimes), I do see low Utilization and MHz for a few seconds (next status is 100% file).
I also ran HWMonitor along Hashcat for a few hours while getting these warnings every few minutes and Max Temperatures (Video/Memory/Hot Spot) have never been above 70c. The MHz dips are documented/logged in HWMonitor and appear instantly after the first warning message pops up in Hashcat.
I've tried multiple attacks with different loads (all using -O and -w4 / -w3 and on NTLM), this image is just an example of the behavior, but this also happens in a straight ?a?a?a?a?a?a?a?a attack, or any other long/short term attack:
https://i.postimg.cc/DwCCjdf5/example.png
In this case the warning was on #6, while it was actually fine, very odd behavior all around.
I would really appreciate some guidance here,
Can something else triggers the temp warnings?
Are the warnings causing the MHz drop?
What can i do to identify the cause of the issue?
We recently purchased an 8xGPU rig for the company I work for.
TYAN 4U (B7129F83AV14E8HR-N)
2*Intel CPU XEON Silver 4316
8*INNO3D RTX3090 N30901-246X-6261960 BLOWER
6*64GB 3200MHz RAM
4800w (+1600w fallback/backup)
We installed Win 10 enterprise as we are going to significantly upgrade the ram (at least 2TB) next year, all Nvidia, Cuda and the rack native drivers have been installed smoothly with zero issues. Performance appears to be amazing, and the quality server room means we haven't seen temps go over 70c yet. Benchmarks are identical to other 3090 seen online, and we're getting awesome cracking speeds (~2h15m for 8*?a on single NTLM hash)
However, we're getting what appears to be false "Driver temperature threshold met on CPU #X. Expect reduced performance" every few minutes (not always on the same GPU).
I hit [s] immediately after getting the warning and while the temps are always fine (70c. and even 40c sometimes), I do see low Utilization and MHz for a few seconds (next status is 100% file).
I also ran HWMonitor along Hashcat for a few hours while getting these warnings every few minutes and Max Temperatures (Video/Memory/Hot Spot) have never been above 70c. The MHz dips are documented/logged in HWMonitor and appear instantly after the first warning message pops up in Hashcat.
I've tried multiple attacks with different loads (all using -O and -w4 / -w3 and on NTLM), this image is just an example of the behavior, but this also happens in a straight ?a?a?a?a?a?a?a?a attack, or any other long/short term attack:
https://i.postimg.cc/DwCCjdf5/example.png
In this case the warning was on #6, while it was actually fine, very odd behavior all around.
I would really appreciate some guidance here,
Can something else triggers the temp warnings?
Are the warnings causing the MHz drop?
What can i do to identify the cause of the issue?