PCI SERR with AMD but not NVIDIA
#1
Hi,

Just wandering if anybody else has encountered this crap:

I have a Supermicro system configuration with X8DTG-DF motherboard, say about 40 servers. They all use AMD HD 6990 and run heavy loads of oclHashcat - some are doing almost 24/7. On two of them a strange problem has appeared: They throw a PCI SERR after a few hours of intensive work (the actual time until error varies, but it shows up sooner or later) and the system hangs until hard reset. The error is "AssertionTongueCI SERR, Bus 0 /Device 3 /Function 0". Only the two show the problem, others do not.

I have replaced all components except the MB. I have switched the GPU to R290X, the problem is still there. I decided to accept the inevitable and to look into replacing the boards (Supermicro accepted this as a hardware fault).

Before sending them back, I decided to try also a Titan X with cudaHashcat, so I have replaced the R290 with Titan X. Surprise - it has been running with almost 100% load for 3 days and no error. Still running.

All test cases are done with the same settings - brute force (a3) for an uncrackable HTLM dummy hash.

Can anybody shed some light into this issue? Have you seen this before? Any significant diffs between the hardware interface of these cards?

Cheers,
ecos
Reply
#2
Sounds power-related to me. How big are the power supplies in these systems, and how many GPUs per system?

Most of the boards produced by Supermicro and Tyan have a clause buried deep in the documentation that states not to use cards that draw more than 300W of power. The 6990 is an unholy power hog (well over 400W), and the 290X is still > 300W (around 325W for single hash NTLM brute force.) The Titan X, on the other hand, only draws 235-250W.
Reply
#3
(06-26-2015, 10:24 PM)epixoip Wrote: Sounds power-related to me. How big are the power supplies in these systems, and how many GPUs per system?

Most of the boards produced by Supermicro and Tyan have a clause buried deep in the documentation that states not to use cards that draw more than 300W of power. The 6990 is an unholy power hog (well over 400W), and the 290X is still > 300W (around 325W for single hash NTLM brute force.) The Titan X, on the other hand, only draws 235-250W.

The PSU id 1400W, one GPU per system (well, one "card", as the 6990 can be seen as two GPUs). Anyway, I did replace the PSU, with the same results. Also monitored the temperature of GPU, MB, RAM ... nothing unusual, just the sudden PCI error, IPMI logs that crap, Linux kernel hangs.
Reply
#4
That's a sufficiently large PSU, but I'm still kind of leaning towards power. Here's why:

With two cards that violate the PCI-e spec, you receive errors. With a card that adheres to the PCI-e spec, you get no errors. So I think the motherboard is intelligent enough to know that > 75W are being pulled through the PCI-e slot, and disables the slot accordingly. Disabling the slot means killing communications between the driver and the hardware, simulating an ASIC hang, leaving the driver stuck in IOWAIT and the kernel in a weird state.
Reply
#5
(06-27-2015, 11:51 PM)epixoip Wrote: That's a sufficiently large PSU, but I'm still kind of leaning towards power. Here's why:

With two cards that violate the PCI-e spec, you receive errors. With a card that adheres to the PCI-e spec, you get no errors. So I think the motherboard is intelligent enough to know that > 75W are being pulled through the PCI-e slot, and disables the slot accordingly. Disabling the slot means killing communications between the driver and the hardware, simulating an ASIC hang, leaving the driver stuck in IOWAIT and the kernel in a weird state.

Well, it turns out that the NVIDIA was just more stable. The cudaHashcat process hung sometime during the past 12 hours. No more PCI SERR, no kernel errors, but also pretty stuck, see below, speed = 0, temp = low, no progress:

Session.Name...: cudaHashcat
Status.........: Running
Input.Mode.....: Mask (?1?2?2?2?2?2?2?3?3?3) [10]
Hash.Target....: 0123456789abcdeffedcba9876543210
Hash.Type......: NTLM
Time.Started...: Thu Jun 25 17:46:00 2015 (2 days, 13 hours)
Time.Estimated.: Sun Jun 28 22:34:50 2015 (13 hours, 12 mins)
Speed.GPU.#1...: 0 H/s
Recovered......: 0/1 (0.00%) Digests, 0/1 (0.00%) Salts
Progress.......: 7662076434579456/9301612953526272 (82.37%)
Rejected.......: 0/7662076434579456 (0.00%)
Restore.Point..: 3432829943808/4167389316096 (82.37%)
HWMon.GPU.#1...: 100% Util, 36c Temp, 100% Fan

Quitting does not really work, but reboot does.

PS. Nvidia settings, I hope they are in acceptable range:
Attribute 'GPUPowerMizerMode' (ep77:0[gpu:0]) assigned value 1.
Attribute 'GPUFanControlState' (ep77:0[gpu:0]) assigned value 1.
Attribute 'GPUGraphicsClockOffset' (ep77:0[gpu:0]) assigned value 225.
Attribute 'GPUTargetFanSpeed' (ep77:0[fan:0]) assigned value 100.
Reply
#6
Just looks like an ASIC hang. +225 clock offset is rather aggressive (depending on the factory clocking), try reducing it to 200.
Reply
#7
(06-28-2015, 01:12 PM)epixoip Wrote: Just looks like an ASIC hang. +225 clock offset is rather aggressive (depending on the factory clocking), try reducing it to 200.

Oh, it was one of the "superclocked" thingies Smile Running now with +100:
Timestamp : Sun Jun 28 13:46:16 2015
Driver Version : 352.21

Attached GPUs : 1
GPU 0000:02:00.0
Clocks
Graphics : 1391 MHz
SM : 1391 MHz
Memory : 3304 MHz

Should this be stable?
Reply
#8
Ah yeah, +225 would put you at well over 1500 Mhz. Should be just fine at +100, and you could probably go up to +200 without issue. But for testing purposes +100 is a good number.
Reply
#9
(06-28-2015, 07:57 PM)epixoip Wrote: Ah yeah, +225 would put you at well over 1500 Mhz. Should be just fine at +100, and you could probably go up to +200 without issue. But for testing purposes +100 is a good number.

Running now for 5 days with +100 and no crash. Guess this was it ... so crashing with AMD and not with Nvidia. Weird.
Reply
#10
Not weird at all... Re-read what I wrote in this thread, it all makes perfect sense Smile

This is actually something we're fighting with right now, as we've learned that the 290X will kill the motherboard in the systems we use after about a year of continuous use. When Tyan and Supermicro say not to use GPUs that draw more than 300W, they rather mean it. So we will no longer be using nor recommending GPUs which violate the PCI-e spec as all high-end AMD GPUs have.
Reply