Problems with Tyan Server
#1
My company was able to get a Tyan FT77AB7059 case (not the older B7015F77V4R), but I am having problems getting it to run with 4 7990's.

First, it is unable to post with all four 7990's installed. The fans spin up, then down, and it just sits there. When I take out any one card, it boots fine. I looked at every power-related setting in the BIOS but was unable to find anything that looked like it might be useful.

Next, after getting the box to boot and installing everything according to the wiki with only 3 7990's installed, I am able to successfully run oclHashcat-plus with the following command:

./oclHashcat-plus64.bin -m0 example0.hash -a 3 ?a?a?a?a?a?a?a?a -d1

Hashcat sees all six GPUs and only uses the first one just fine. When I run it with -d1,2 Hashcat starts and gets to the status prompt and then the screen goes black and I have to reboot. If I run it with -d1,3,5 then Hashcat works as expected.

Essentially, I am able to run any single GPU (within each dual-GPU card) without a problem, like -d2 or -d6. I am able to do any combination of one GPU per card, like -d1,3,6 or -d2,3,5. However, if I do -d3,4 or -d5,6, then Hashcat gets to the status prompt and immediately throws a temperature override error (even though the card is cool and just started) and the terminal window becomes unresponsive. When I open the case, the GPU that I used had been spun down.

I have the box hooked up to a dedicated 220 volt / 30 amp circuit with nothing else on it, so the box has plenty of power. I noticed that the motherboard does have an over current protection (OCP) that kicks in when any GPU is over 300W TDP (page 152 in the manual), but that does not seem to be limitation of the older Tyan model. Any ideas?

Thanks for any pointers.
#2
the tyan is a bitch to get up and running, especially if you have no prior experience with it. better to just buy it from a vendor pre-configured Wink

make sure you have 64-bit mode enabled in the bios, vt-d disabled, iommu disabled, graphics priority set to offboard, onboard adapter disabled, xorg properly configured. if you followed my guide on the wiki for building out the server, then you'll need to use the backported kernel from quantal or raring, and if you use those then i'd highly recommend using the backported xorg as well.
#3
If only they offered the Tyan's pre-configured with four 7990's instead of Teslas which are 3-4 times as expensive, my company might have gone for it.

I followed your BIOS settings but to no avail, the box is exhibiting the same behavior.

Instead of using a backported kernel, I was going to just install quantal or raring so I didn't have to deal with backporting xorg. If that doesn't work, I was going to try CentOS 6.4, since the box officially supports RHEL 6.4, according to their website.

What bothers me is that the box will not consistently post with all four 7990's, which means this is probably not an OS issue. Sometimes it does, but when I shutdown or reboot it just hangs after it starts most of the time. I have tried moving the cards to different slots and any one of the four 7990's will work by itself, and even any combination of three. When I connect the fourth card, everything turns on and spins up, but that is it. If I connect the monitor to any of the 7990's with the fourth card in, it will just stay blank and the monitor will report that there is no video signal. If I connect to the VGA/USB port on the back, I get the status code of 0xAD, which is the "Ready To Boot event," according to the manual.

Can I ask what you did to get your Tyan boxes up and running?
#4
(10-31-2013, 04:30 AM)Corned Beef Wrote: If only they offered the Tyan's pre-configured with four 7990's instead of Teslas which are 3-4 times as expensive, my company might have gone for it.

they do. with warranty and support packages even.


(10-31-2013, 04:30 AM)Corned Beef Wrote: Instead of using a backported kernel, I was going to just install quantal or raring so I didn't have to deal with backporting xorg.

you don't have to deal with backporting xorg. it's in precise's repo. just like the backported kernel. ''apt-cache search lts-raring''


(10-31-2013, 04:30 AM)Corned Beef Wrote: Can I ask what you did to get your Tyan boxes up and running?

it doesn't sound like you are encountering any of the common issues, so i'd recommend clearing the bios and starting over from scratch.

with a clear bios, it should show you a message in the bios telling you that you do not have enough memory for all available pci devices. this is when you go in and enable above 4G decoding to enable 64bit resource handling.

if you are not even getting this far, then it sounds like your board might be damaged.
#5
I figured out the 4G decoding early on when I got the "not enough PCI resources" error. I spent a good deal of time poking around the bios and adjusting things, but it was good to know I was on the right track.

I was able to track down my original issue, which was power related, but not in the way I was thinking.

My problem turned out to be the table on page 152 of the manual. It outlines the ports on the power distribution board that should be connected to each GPU. For example, it states J3 or J4 should be connected to GPU 8. Since the 7990's require two 4-pin connectors, I was connecting both J3 and J4 to the 7990. Instead, I should have connected either J3 or J4, and then chosen a port from the other channel on the PDB. Once I started alternating the ports (J3 with J5, J7 and J9, etc), everything worked fine. I can now run all four 7990's under the standard Ubuntu 12.04 setup.

Which brings me to the cooling problem. While the Tyan does indeed push a biblical amount of air, two of the GPUs heat up to 90c in about 3-4 minutes and oclHahscat kills the run. The rest stay about 5-10c cooler, but they all climb into the 80's before the run is killed. I set the auto fan control in the bios to off so the case fans stay on full blast, but that doesn't seem to help. The 7990's idle in the mid 40c range. I used amdconfig to set the GPU fan to 100 percent as well, but that only helped delay the temperature cut-off another minute or so. The cards are spaced out a slot from each other and from the sides, so there is plenty of room for air to move around them. It looks like you are using the factory case fans from the pictures you posted as well. Is there anything else you had to do to keep the 7990's in the mid 70's?
#6
ah yeah, that will do it.

7990 runs at the boost clocks in linux out of the box (1000/1500), so you'll need to use overdrive to push the clocks back down to the stock 925/1375.

7990 is very sensitive to ambient temperature, so make sure you have it in a chilly room. disable fan management in the bios so that the chassis fans are at full chat. set the fan speed on the GPUs to 100% as soon as x11 loads to prevent heat from rapidly building up when starting compute jobs.

the GPU driving the display will always be hotter than the rest, so make sure you have the bare minimum necessary on the display. running things like compiz or whatever compositor the cool kids are using these days will only exacerbate the problem.
#7
(11-02-2013, 07:33 AM)epixoip Wrote: the GPU driving the display will always be hotter than the rest, so make sure you have the bare minimum necessary on the display. running things like compiz or whatever compositor the cool kids are using these days will only exacerbate the problem.

On a bit of a side note- to avoid having this one GPU running hotter than the others, is it at all possible to use built-in on board video to drive the display? I've tried under Ubuntu Server 12.04LTS; the display works, then when the "aticonfig --initial -f" is run it creates an xorg.conf that tries to take over the display...

After correcting the xorg.conf to run on the built-in video, I try an "aticonfig --odgt" and get:

"X needs to be running to perform ATI Overdrive commands"
#8
no, that will not work, because the x server on the onboard display would be using a different driver, and thus wouldn't help you one bit. you can run an x server on the onboard display if you want, but you'll ALSO have to run one on the AMD gpus as well.
#9
epixoip,

you had mentioned in an earilier reply that the GPU driving the display will always be hotter than the rest. I'm experiencing this, however, was toying with the idea of using a iKVM type of device to pull the display through my ikvm. do you know if this has the same effect as a standard monitor? thanks
#10
doesn't matter if a monitor is hooked up or not, that's where your primary screen is. just something you have to live with.