06-26-2015, 08:09 PM
Just wandering if anybody else has encountered this crap:
I have a Supermicro system configuration with X8DTG-DF motherboard, say about 40 servers. They all use AMD HD 6990 and run heavy loads of oclHashcat - some are doing almost 24/7. On two of them a strange problem has appeared: They throw a PCI SERR after a few hours of intensive work (the actual time until error varies, but it shows up sooner or later) and the system hangs until hard reset. The error is "Assertion
CI SERR, Bus 0 /Device 3 /Function 0". Only the two show the problem, others do not.
I have replaced all components except the MB. I have switched the GPU to R290X, the problem is still there. I decided to accept the inevitable and to look into replacing the boards (Supermicro accepted this as a hardware fault).
Before sending them back, I decided to try also a Titan X with cudaHashcat, so I have replaced the R290 with Titan X. Surprise - it has been running with almost 100% load for 3 days and no error. Still running.
All test cases are done with the same settings - brute force (a3) for an uncrackable HTLM dummy hash.
Can anybody shed some light into this issue? Have you seen this before? Any significant diffs between the hardware interface of these cards?
Just wandering if anybody else has encountered this crap:
I have a Supermicro system configuration with X8DTG-DF motherboard, say about 40 servers. They all use AMD HD 6990 and run heavy loads of oclHashcat - some are doing almost 24/7. On two of them a strange problem has appeared: They throw a PCI SERR after a few hours of intensive work (the actual time until error varies, but it shows up sooner or later) and the system hangs until hard reset. The error is "Assertion

I have replaced all components except the MB. I have switched the GPU to R290X, the problem is still there. I decided to accept the inevitable and to look into replacing the boards (Supermicro accepted this as a hardware fault).
Before sending them back, I decided to try also a Titan X with cudaHashcat, so I have replaced the R290 with Titan X. Surprise - it has been running with almost 100% load for 3 days and no error. Still running.
All test cases are done with the same settings - brute force (a3) for an uncrackable HTLM dummy hash.
Can anybody shed some light into this issue? Have you seen this before? Any significant diffs between the hardware interface of these cards?