One of six GPUs has 99% Util and 0 H/s, Progress gets stuck at 99%
#1
Arch: 64bit
OS: Linux
GPU: 7 x AMD Sapphire 7970 HD
oclHashcat: 1.31
Catalyst: fglrx 14.30.4

The run starts fine. Status looks good. At some point one of the GPU's starts to show 99% Util and 0 H/s, Progress gets stuck at 99% and the job never finishes because it seems to be waiting for the last GPU to finish. I'm already skipping one GPU due to this issue, and now a second is acting up.

Command for testing, run should complete in about 25 min. Notice I'm already skipping GPU 6 due to the same issue.

oclHashcat64.bin --gpu-devices=1,2,3,4,5,7 --gpu-temp-retain=65 -m 0 notahash -a 3 -1 ?l?u?s?d ?1?1?1?1?1?1?1

Three status are shown below
1. Beginning, all good
2. GPU 4 wedged, other GPUs still working at progress 7%
3. GPU 4 wedged, other GPUs finished with work, progress 99%, will never finish.


1. Beginning, all good

Session.Name...: oclHashcat
Status.........: Running
Input.Mode.....: Mask (?1?1?1?1?1?1?1) [7]
Hash.Target....: ffd1ca720c067fadb7b57ee4ab07db05
Hash.Type......: MD5
Time.Started...: Mon Jan 19 15:15:16 2015 (26 secs)
Time.Estimated.: Mon Jan 19 15:42:13 2015 (26 mins, 29 secs)
Speed.GPU.#1...: 7377.6 MH/s
Speed.GPU.#2...: 7374.1 MH/s
Speed.GPU.#3...: 7388.7 MH/s
Speed.GPU.#4...: 7370.7 MH/s
Speed.GPU.#5...: 7392.8 MH/s
Speed.GPU.#6...: 7369.9 MH/s
Speed.GPU.#*...: 44273.8 MH/s
Recovered......: 0/1 (0.00%) Digests, 0/1 (0.00%) Salts
Progress.......: 1150984126464/69833729609375 (1.65%)
Skipped........: 0/1150984126464 (0.00%)
Rejected.......: 0/1150984126464 (0.00%)
HWMon.GPU.#1...: 94% Util, 45c Temp, 20% Fan
HWMon.GPU.#2...: 94% Util, 48c Temp, 20% Fan
HWMon.GPU.#3...: 94% Util, 45c Temp, 20% Fan
HWMon.GPU.#4...: 94% Util, 55c Temp, 15% Fan
HWMon.GPU.#5...: 94% Util, 44c Temp, 20% Fan
HWMon.GPU.#6...: 94% Util, 45c Temp, 20% Fan


2. GPU 4 wedged at progress 7%, other GPUs still normal

Session.Name...: oclHashcat
Status.........: Running
Input.Mode.....: Mask (?1?1?1?1?1?1?1) [7]
Hash.Target....: ffd1ca720c067fadb7b57ee4ab07db05
Hash.Type......: MD5
Time.Started...: Mon Jan 19 15:15:16 2015 (2 mins, 1 sec)
Time.Estimated.: Mon Jan 19 15:42:05 2015 (24 mins, 46 secs)
Speed.GPU.#1...: 7354.0 MH/s
Speed.GPU.#2...: 7454.7 MH/s
Speed.GPU.#3...: 7327.9 MH/s
Speed.GPU.#4...: 0 H/s
Speed.GPU.#5...: 7385.6 MH/s
Speed.GPU.#6...: 7400.3 MH/s
Speed.GPU.#*...: 36922.5 MH/s
Recovered......: 0/1 (0.00%) Digests, 0/1 (0.00%) Salts
Progress.......: 5287989215232/69833729609375 (7.57%)
Skipped........: 0/5287989215232 (0.00%)
Rejected.......: 0/5287989215232 (0.00%)
HWMon.GPU.#1...: 94% Util, 66c Temp, 15% Fan
HWMon.GPU.#2...: 94% Util, 72c Temp, 40% Fan
HWMon.GPU.#3...: 94% Util, 64c Temp, 15% Fan
HWMon.GPU.#4...: 99% Util, 62c Temp, 15% Fan
HWMon.GPU.#5...: 94% Util, 67c Temp, 15% Fan
HWMon.GPU.#6...: 93% Util, 66c Temp, 15% Fan



3. GPU 4 wedged, other cards finished with work, progress 99%, will never finish.

Session.Name...: oclHashcat
Status.........: Running
Input.Mode.....: Mask (?1?1?1?1?1?1?1) [7]
Hash.Target....: ffd1ca720c067fadb7b57ee4ab07db05
Hash.Type......: MD5
Time.Started...: Mon Jan 19 15:15:16 2015 (31 mins, 44 secs)
Time.Estimated.: Mon Jan 19 15:47:09 2015 (7 secs)
Speed.GPU.#1...: 0 H/s
Speed.GPU.#2...: 0 H/s
Speed.GPU.#3...: 0 H/s
Speed.GPU.#4...: 0 H/s
Speed.GPU.#5...: 0 H/s
Speed.GPU.#6...: 0 H/s
Speed.GPU.#*...: 0 H/s
Recovered......: 0/1 (0.00%) Digests, 0/1 (0.00%) Salts
Progress.......: 69544608370335/69833729609375 (99.59%)
Skipped........: 0/69544608370335 (0.00%)
Rejected.......: 0/69544608370335 (0.00%)
HWMon.GPU.#1...: 0% Util, 51c Temp, 15% Fan
HWMon.GPU.#2...: 0% Util, 51c Temp, 15% Fan
HWMon.GPU.#3...: 0% Util, 57c Temp, 15% Fan
HWMon.GPU.#4...: 99% Util, 60c Temp, 15% Fan
HWMon.GPU.#5...: 0% Util, 51c Temp, 15% Fan
HWMon.GPU.#6...: 0% Util, 50c Temp, 15% Fan
#2
Try oclhashcat v1.32 and see if it still occur.
#3
(01-20-2015, 04:23 PM)mastercracker Wrote: Try oclhashcat v1.32 and see if it still occur.

Thanks for pointing out the new version mastercracker. Just tried oclhashcat v1.32 and get the same result.

Here's a status just before and then wedged

Session.Name...: oclHashcat
Status.........: Running
Input.Mode.....: Mask (?1?1?1?1?1?1?1) [7]
Hash.Target....: ffd1ca720c067fadb7b57ee4ab07db05
Hash.Type......: MD5
Time.Started...: Tue Jan 20 10:49:31 2015 (2 mins, 5 secs)
Time.Estimated.: Tue Jan 20 11:16:04 2015 (24 mins, 26 secs)
Speed.GPU.#1...: 7352.2 MH/s
Speed.GPU.#2...: 7310.1 MH/s
Speed.GPU.#3...: 7379.6 MH/s
Speed.GPU.#4...: 7209.6 MH/s
Speed.GPU.#5...: 7301.1 MH/s
Speed.GPU.#6...: 7442.9 MH/s
Speed.GPU.#*...: 43995.5 MH/s
Recovered......: 0/1 (0.00%) Digests, 0/1 (0.00%) Salts
Progress.......: 5509631442944/69833729609375 (7.89%)
Skipped........: 0/5509631442944 (0.00%)
Rejected.......: 0/5509631442944 (0.00%)
HWMon.GPU.#1...: 94% Util, 67c Temp, 15% Fan
HWMon.GPU.#2...: 93% Util, 73c Temp, 61% Fan
HWMon.GPU.#3...: 94% Util, 65c Temp, 15% Fan
HWMon.GPU.#4...: 96% Util, 62c Temp, 41% Fan
HWMon.GPU.#5...: 94% Util, 68c Temp, 15% Fan
HWMon.GPU.#6...: 94% Util, 67c Temp, 15% Fan

[s]tatus [p]ause [r]esume [b]ypass [q]uit => s

Session.Name...: oclHashcat
Status.........: Running
Input.Mode.....: Mask (?1?1?1?1?1?1?1) [7]
Hash.Target....: ffd1ca720c067fadb7b57ee4ab07db05
Hash.Type......: MD5
Time.Started...: Tue Jan 20 10:49:31 2015 (2 mins, 8 secs)
Time.Estimated.: Tue Jan 20 11:16:10 2015 (24 mins, 29 secs)
Speed.GPU.#1...: 7349.4 MH/s
Speed.GPU.#2...: 7311.1 MH/s
Speed.GPU.#3...: 7303.3 MH/s
Speed.GPU.#4...: 0 H/s
Speed.GPU.#5...: 7373.7 MH/s
Speed.GPU.#6...: 7396.5 MH/s
Speed.GPU.#*...: 36734.0 MH/s
Recovered......: 0/1 (0.00%) Digests, 0/1 (0.00%) Salts
Progress.......: 5619555762176/69833729609375 (8.05%)
Skipped........: 0/5619555762176 (0.00%)
Rejected.......: 0/5619555762176 (0.00%)
HWMon.GPU.#1...: 94% Util, 68c Temp, 15% Fan
HWMon.GPU.#2...: 93% Util, 73c Temp, 76% Fan
HWMon.GPU.#3...: 94% Util, 65c Temp, 15% Fan
HWMon.GPU.#4...: 99% Util, 61c Temp, 15% Fan
HWMon.GPU.#5...: 94% Util, 68c Temp, 15% Fan
HWMon.GPU.#6...: 94% Util, 67c Temp, 15% Fan
#4
only see 6 gpus ? what happened to 7 ?= burned ?
#5
(01-20-2015, 10:14 PM)ati6990 Wrote: only see 6 gpus ? what happened to 7 ?= burned ?

I first had the problem with GPU 6, and now having the same problem with GPU 4. Still not sure if a hardware problem or software bug. They seem to work initially then get wedged a few minutes into a run.

I use the --gpu-devices switch to list the cards that aren't getting wedged which is now 1,2,3,5,7

At the start of a run you can see this as below. It's a little confusing in the status output which lists them sequentially without indicating the skips.

oclHashcat v1.32 starting...

Device #1: Tahiti, 2876MB, 925Mhz, 32MCU
Device #2: Tahiti, 3022MB, 925Mhz, 32MCU
Device #3: Tahiti, 3022MB, 925Mhz, 32MCU
Device #4: skipped by user
Device #5: Tahiti, 3022MB, 925Mhz, 32MCU
Device #6: skipped by user
Device #7: Tahiti, 3022MB, 925Mhz, 32MCU
#6
@m0t0

It seems your GPUs are dying...at least from what you've described so far. Have you tried to isolate GPU 4 and 6 and run oclHashcat only with those 2?
Or can you test those two Cards in a seperate Machine? This should quickly clarifiy if its the Hardware thats dying or if its oclHashcat.

Kind Regards
#7
(01-21-2015, 10:31 AM)TheDarkOne Wrote: @m0t0

It seems your GPUs are dying...at least from what you've described so far. Have you tried to isolate GPU 4 and 6 and run oclHashcat only with those 2?
Or can you test those two Cards in a seperate Machine? This should quickly clarifiy if its the Hardware thats dying or if its oclHashcat.

Kind Regards

Thanks TheDarkOne, good idea. I just ran the test isolating GPU 4 and 6 and they completed without issue.

One more thing I've noticed about this issue is that once it occurs job control is lost. I cannot [q]uit and even trying to sudo kill -9 the pid just results in a defunct process.
#8
hmm have you checked out all special settings on your motherboard ? do yoou have raizzerz for your cards, the first time i test them , was the last time i ever wanna test some...
#9
Also it can be too high clocks for gpu/ram or too low ram clock, also PSU problems can behave like that. What I would try is just to lower clocks a bit. Besides on your card's you should get a bit higher MH/s if they are on stock clocks (1000MHz?) so maybe cooling problems occur and they start throttling clocks and this sometimes is the reason for ASIC hang.
#10
(01-22-2015, 11:36 AM)ati6990 Wrote: hmm have you checked out all special settings on your motherboard ? do yoou have raizzerz for your cards, the first time i test them , was the last time i ever wanna test some...

Thanks @ati6990, no risers. Will review motherboard settings.