Three hd7970 = OS hang
#1
Hey Guys,

recently three of my older cards brick (1 x hd5970, 1 x hd6990 and 1 x hd7970) so bought three hd7970's.

The new cards are all GIGABYTE GV-7970C-3GD which are overclocked to 1000mhz by vendor.

Problem is that when running them in parallel the OS hangs after 1-2 minutes.

But I think the GPUs are ok because if I run them solo with -d 1, -d 2 and -d 3 the OS does not hang. It's only if I run at least two in parallel the OS hangs. It doesn't matter if they run it in a single or in multiple oclHashcat instances.

To sort out the problem i tried a lot of different scenarios but now I am out of ideas Sad

First I tried them on two different systems that I use for a couple of time with other GPUs:

1st:

- Intel I7 4770k
- ASUS Z87 Expert
- Ubuntu 14.04 lts, 64 bit

2nd:

- Intel I7 4770k
- ASUS Z87 A
- Ubuntu 12.04 lts, 64 bit

On both systems the behavior is exactly the same, and since I used these system with other cards my feeling tells me there's no hardware defect on the boards, cpus or rams.

More Information about hardware:

- The original cooler as been removed and replaced with EK Watercooling blocks. They are connected in serial with a watercooling bridge. The cooling flow works fine
- There are no extender cables/risers involved. All cards sit directly on the board
- All cards run headless, none of them is connected to a monitor

Heat: The GPU's run at ~40c on idle and increase to ~55c under load before the OS hangs. There's no special heat threshold that lead to OS hangs, it happens somewhere > 50c.
Power: each GPU has a dedicated 700W power supply

Things I tried to change:

- Tryed with catalyst 14.4 and 14.6 beta on both systems, always ran amdconfig --initial -f --adapter=all afterwards and rebooted
- Updated the Mainboards bios to the latest versions (1803)
- Updated the GPUs bios to the latest versions (F72)
- Manually switched PCI-E settings in bios to from x16 to x1
- Manually disabled ASPM in bios
- Manually disabled all other power-managed related stuff in bios
- Underclocked the cards to stock hd7970 settings (925/1375)
- Attached original fan to fan-plug on the cards
- Switched the GPU positions from 1 to 2, 2 to 3, 3 to 1, etc..
- Disabled iommu on kernel commandline
- Blacklisted mei and mei_me modules
- Tried only with 2 cards
- Tried both, ALU intensive and memory intensive algorithm
- Bought a new mainboard (AMD AM3+ with 990fx chipset); to make sure it's not the Z87
- Switched the switch onboard to "2" to use the F70 bios that comes per default
- Switched back to "1" and tried to flash with reference hd7970 bios. This action nearly bricked the card as it was causes instant kernel reboots. So I flashed it back to F72 which is the latest version
- Installed a fresh Windows 7 (64 bit) and tried on windows
- Attached the crossfire bridges
- Removed the crossfire bridges
- dmesg didn't say anything usefull
- X11 log didn't say anything usefull
- Replaced the PSU's with other ones

One thing to note is that when I disabled X11 (so that ADL can't work and oclHashcat can not read temps etc) it looks like this:

Quote:Speed.GPU.#1...: 15891.9 MH/s
Speed.GPU.#2...: 0 H/s
Speed.GPU.#3...: 0 H/s
Speed.GPU.#*...: 15891.9 MH/s

... and when I then continiously press "s" it seems #1 continues to work ...

But when I have X11 enabled and temps are read, it always looks like this:

Quote:[s]tatus [p]ause [r]esume [b]ypass [q]uit =>

Speed.GPU.#1...: 15873.6 MH/s
Speed.GPU.#2...: 15871.1 MH/s
Speed.GPU.#3...: 15880.4 MH/s
Speed.GPU.#*...: 47625.2 MH/s
Recovered......: 0/1 (0.00%) Digests, 0/1 (0.00%) Salts
Progress.......: 1567903186944/6634204312890625 (0.02%)
Skipped........: 0/1567903186944 (0.00%)
Rejected.......: 0/1567903186944 (0.00%)
HWMon.GPU.#1...: 98% Util, 41c Temp, 29% Fan
HWMon.GPU.#2...: 98% Util, 41c Temp, 29% Fan
HWMon.GPU.#3...: 98% Util, 42c Temp, 31% Fan

[s]tatus [p]ause [r]esume [b]ypass [q]uit =>

ERROR: Temperature limit on GPU 2 reached, aborting...

The system is completely frozen at this point.

Another interessting thing is by looking at the lspci output the cards run on a different PCI-E speed and ignoring my manual x1 setting from bios:

Code:
root@et:~# lspci -vv | grep -e "VGA " -e Width
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tahiti XT [Radeon HD 7970/8970 OEM / R9 280X] (prog-if 00 [VGA controller])
                LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
                LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
                LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
02:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tahiti XT [Radeon HD 7970/8970 OEM / R9 280X] (prog-if 00 [VGA controller])
                LnkCap: Port #1, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
                LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                LnkCap: Port #1, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
                LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s unlimited, L1 <64us
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tahiti XT [Radeon HD 7970/8970 OEM / R9 280X] (prog-if 00 [VGA controller])
                LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
                LnkSta: Speed 2.5GT/s, Width x2, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
                LnkSta: Speed 2.5GT/s, Width x2, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

This is from ubuntu 14.04.

---

Updated with latest tests to have them complete
Reply
#2
Have you tried observing dmesg (cat /dev/kmsg)? Sometimes it prints interesting stuff right before a crash (like driver problems).
Reply
#3
For me it looks like power problems, like cards draw too much power from pci'e and inteligent power management on these motherboards disable them.
Reply
#4
(07-27-2014, 06:53 PM)undeath Wrote: Have you tried observing dmesg (cat /dev/kmsg)? Sometimes it prints interesting stuff right before a crash (like driver problems).

Nothing unusual I'd say..

last entry after crash is:

Quote:[ 4.438533] r8169 0000:04:00.0 eth0: link up
[ 4.438538] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

But I think that means that link up occour 4s after kernel boot, so that's not related...
Reply
#5
(07-27-2014, 09:38 PM)KT819GM Wrote: For me it looks like power problems, like cards draw too much power from pci'e and inteligent power management on these motherboards disable them.

Yeah I think there is some relation. I disabled *everything* in MB bios I could find that was somehow about power management. But it didn't help, after a few minutes it crashed again Sad

Nice try anyway!
Reply
#6
(07-27-2014, 06:26 PM)atom Wrote: Power: each GPU has a dedicated 700W power supply

Did you tried different power supply, other than those 3 ?
Did you tried to use one of those 7970 with anything else, other than the other 2 7970 ?
Reply
#7
I can be wrong but i think You should try run it with monitor connected ( in my case pc without monitor working "strange" )
Also belive that all PSU are connected to mobo, they all must/should start working in same time
Reply
#8
(07-30-2014, 12:18 PM)Szulik Wrote: I can be wrong but i think You should try run it with monitor connected ( in my case pc without monitor working "strange" )
Also belive that all PSU are connected to mobo, they all must/should start working in same time

Good Idea. I tried, but did not help Sad
Reply
#9
(07-29-2014, 11:19 PM)proinside Wrote:
(07-27-2014, 06:26 PM)atom Wrote: Power: each GPU has a dedicated 700W power supply

Did you tried different power supply, other than those 3 ?
Did you tried to use one of those 7970 with anything else, other than the other 2 7970 ?

I tried with another 1200W and another 1300W, didn't help Sad
Reply
#10
Just an update, I've added some more tests:

- I bought a new mainboard, AMD AM3+ with 990fx bchipset; to make sure it's not the Z87
- Switched the switch onboard to "2" to use the F70 bios that comes per default
- Switched back to "1" and tried to flash with reference hd7970 bios. This action nearly bricked the card as it was causes instant kernel reboots. So I flashed it back to F72 which is the latest version
- Installed a fresh Windows 7 (64 bit) and tried on windows
- Attached the crossfire bridges
- Removed the crossfire bridges

None of the above helped
Reply