08-15-2014, 04:27 PM
I finally solved the problem. The source was simple: heat
I want to explain the problem a bit, maybe it helps someone else in the future.
My old watercooling curcuit had around 1kw cooling power and around 750ml of water. It cooled down the GPUs pretty good, but not good enough. In idle mode, all GPU were around 40c.
The problems began when I added load to the GPUs. When I did that, the temps raised to around 55c and then the OS began to hang. But 55c, come on, that's nothing. Who would expect this could create problems?
When I put only a single card on load it always worked. The system was stable. Using only a single GPU, the cooling circuit was able to hold it below 50c.
So the Idea was that could it be that if the cards go > 55c that something strange happens?
As I was out of ideas, I really tried everything as you can see in the first post, I took the risk and added some more watercooling components. I've added another radiator, another pump, three more fans and another reservoir. In total the watercooling circuit has now a cooling power of 2kw and the water used increased from 750ml to 2500ml.
When I now run the system, even with crazies settings like -m 900 -n 800 and -u 1024 in -a 3 mode, it is able to cool all 3 GPUs at around 45c. It's running since a day now and it's rock stable. All three GPUs run at ~ 17BH/s, all run at 99% gpu utilization.
So what exactly is the problem? Well, I don't know. All I know it's related to heat. My speculation is the following: There's a bug in GIGABYTE GV-7970C-3GD GPU bios. This card forces me to use their customized GPU bios. I tried to flash it with different 7970 bios including the reference one but whenever I did that the card stayed black on boot. It's possible that this bug somehow creates a false positive alarm signal, for example it adds up all temps but then forgets to divide by the number of GPUs. So if you have two or more card, GPU bios thinks temp is higher than 110c and call some emergency shutdown. Well it's wild speculation.
I want to explain the problem a bit, maybe it helps someone else in the future.
My old watercooling curcuit had around 1kw cooling power and around 750ml of water. It cooled down the GPUs pretty good, but not good enough. In idle mode, all GPU were around 40c.
The problems began when I added load to the GPUs. When I did that, the temps raised to around 55c and then the OS began to hang. But 55c, come on, that's nothing. Who would expect this could create problems?
When I put only a single card on load it always worked. The system was stable. Using only a single GPU, the cooling circuit was able to hold it below 50c.
So the Idea was that could it be that if the cards go > 55c that something strange happens?
As I was out of ideas, I really tried everything as you can see in the first post, I took the risk and added some more watercooling components. I've added another radiator, another pump, three more fans and another reservoir. In total the watercooling circuit has now a cooling power of 2kw and the water used increased from 750ml to 2500ml.
When I now run the system, even with crazies settings like -m 900 -n 800 and -u 1024 in -a 3 mode, it is able to cool all 3 GPUs at around 45c. It's running since a day now and it's rock stable. All three GPUs run at ~ 17BH/s, all run at 99% gpu utilization.
So what exactly is the problem? Well, I don't know. All I know it's related to heat. My speculation is the following: There's a bug in GIGABYTE GV-7970C-3GD GPU bios. This card forces me to use their customized GPU bios. I tried to flash it with different 7970 bios including the reference one but whenever I did that the card stayed black on boot. It's possible that this bug somehow creates a false positive alarm signal, for example it adds up all temps but then forgets to divide by the number of GPUs. So if you have two or more card, GPU bios thinks temp is higher than 110c and call some emergency shutdown. Well it's wild speculation.