![]() |
PCI SERR with AMD but not NVIDIA - Printable Version +- hashcat Forum (https://hashcat.net/forum) +-- Forum: Misc (https://hashcat.net/forum/forum-15.html) +--- Forum: Hardware (https://hashcat.net/forum/forum-13.html) +--- Thread: PCI SERR with AMD but not NVIDIA (/thread-4483.html) Pages:
1
2
|
PCI SERR with AMD but not NVIDIA - ecos - 06-26-2015 Hi, Just wandering if anybody else has encountered this crap: I have a Supermicro system configuration with X8DTG-DF motherboard, say about 40 servers. They all use AMD HD 6990 and run heavy loads of oclHashcat - some are doing almost 24/7. On two of them a strange problem has appeared: They throw a PCI SERR after a few hours of intensive work (the actual time until error varies, but it shows up sooner or later) and the system hangs until hard reset. The error is "Assertion ![]() I have replaced all components except the MB. I have switched the GPU to R290X, the problem is still there. I decided to accept the inevitable and to look into replacing the boards (Supermicro accepted this as a hardware fault). Before sending them back, I decided to try also a Titan X with cudaHashcat, so I have replaced the R290 with Titan X. Surprise - it has been running with almost 100% load for 3 days and no error. Still running. All test cases are done with the same settings - brute force (a3) for an uncrackable HTLM dummy hash. Can anybody shed some light into this issue? Have you seen this before? Any significant diffs between the hardware interface of these cards? Cheers, ecos RE: PCI SERR with AMD but not NVIDIA - epixoip - 06-26-2015 Sounds power-related to me. How big are the power supplies in these systems, and how many GPUs per system? Most of the boards produced by Supermicro and Tyan have a clause buried deep in the documentation that states not to use cards that draw more than 300W of power. The 6990 is an unholy power hog (well over 400W), and the 290X is still > 300W (around 325W for single hash NTLM brute force.) The Titan X, on the other hand, only draws 235-250W. RE: PCI SERR with AMD but not NVIDIA - ecos - 06-27-2015 (06-26-2015, 10:24 PM)epixoip Wrote: Sounds power-related to me. How big are the power supplies in these systems, and how many GPUs per system? The PSU id 1400W, one GPU per system (well, one "card", as the 6990 can be seen as two GPUs). Anyway, I did replace the PSU, with the same results. Also monitored the temperature of GPU, MB, RAM ... nothing unusual, just the sudden PCI error, IPMI logs that crap, Linux kernel hangs. RE: PCI SERR with AMD but not NVIDIA - epixoip - 06-27-2015 That's a sufficiently large PSU, but I'm still kind of leaning towards power. Here's why: With two cards that violate the PCI-e spec, you receive errors. With a card that adheres to the PCI-e spec, you get no errors. So I think the motherboard is intelligent enough to know that > 75W are being pulled through the PCI-e slot, and disables the slot accordingly. Disabling the slot means killing communications between the driver and the hardware, simulating an ASIC hang, leaving the driver stuck in IOWAIT and the kernel in a weird state. RE: PCI SERR with AMD but not NVIDIA - ecos - 06-28-2015 (06-27-2015, 11:51 PM)epixoip Wrote: That's a sufficiently large PSU, but I'm still kind of leaning towards power. Here's why: Well, it turns out that the NVIDIA was just more stable. The cudaHashcat process hung sometime during the past 12 hours. No more PCI SERR, no kernel errors, but also pretty stuck, see below, speed = 0, temp = low, no progress: Session.Name...: cudaHashcat Status.........: Running Input.Mode.....: Mask (?1?2?2?2?2?2?2?3?3?3) [10] Hash.Target....: 0123456789abcdeffedcba9876543210 Hash.Type......: NTLM Time.Started...: Thu Jun 25 17:46:00 2015 (2 days, 13 hours) Time.Estimated.: Sun Jun 28 22:34:50 2015 (13 hours, 12 mins) Speed.GPU.#1...: 0 H/s Recovered......: 0/1 (0.00%) Digests, 0/1 (0.00%) Salts Progress.......: 7662076434579456/9301612953526272 (82.37%) Rejected.......: 0/7662076434579456 (0.00%) Restore.Point..: 3432829943808/4167389316096 (82.37%) HWMon.GPU.#1...: 100% Util, 36c Temp, 100% Fan Quitting does not really work, but reboot does. PS. Nvidia settings, I hope they are in acceptable range: Attribute 'GPUPowerMizerMode' (ep77:0[gpu:0]) assigned value 1. Attribute 'GPUFanControlState' (ep77:0[gpu:0]) assigned value 1. Attribute 'GPUGraphicsClockOffset' (ep77:0[gpu:0]) assigned value 225. Attribute 'GPUTargetFanSpeed' (ep77:0[fan:0]) assigned value 100. RE: PCI SERR with AMD but not NVIDIA - epixoip - 06-28-2015 Just looks like an ASIC hang. +225 clock offset is rather aggressive (depending on the factory clocking), try reducing it to 200. RE: PCI SERR with AMD but not NVIDIA - ecos - 06-28-2015 (06-28-2015, 01:12 PM)epixoip Wrote: Just looks like an ASIC hang. +225 clock offset is rather aggressive (depending on the factory clocking), try reducing it to 200. Oh, it was one of the "superclocked" thingies ![]() Timestamp : Sun Jun 28 13:46:16 2015 Driver Version : 352.21 Attached GPUs : 1 GPU 0000:02:00.0 Clocks Graphics : 1391 MHz SM : 1391 MHz Memory : 3304 MHz Should this be stable? RE: PCI SERR with AMD but not NVIDIA - epixoip - 06-28-2015 Ah yeah, +225 would put you at well over 1500 Mhz. Should be just fine at +100, and you could probably go up to +200 without issue. But for testing purposes +100 is a good number. RE: PCI SERR with AMD but not NVIDIA - ecos - 07-04-2015 (06-28-2015, 07:57 PM)epixoip Wrote: Ah yeah, +225 would put you at well over 1500 Mhz. Should be just fine at +100, and you could probably go up to +200 without issue. But for testing purposes +100 is a good number. Running now for 5 days with +100 and no crash. Guess this was it ... so crashing with AMD and not with Nvidia. Weird. RE: PCI SERR with AMD but not NVIDIA - epixoip - 07-04-2015 Not weird at all... Re-read what I wrote in this thread, it all makes perfect sense ![]() This is actually something we're fighting with right now, as we've learned that the 290X will kill the motherboard in the systems we use after about a year of continuous use. When Tyan and Supermicro say not to use GPUs that draw more than 300W, they rather mean it. So we will no longer be using nor recommending GPUs which violate the PCI-e spec as all high-end AMD GPUs have. |