Tesla K20m sha512crypt dictionary attack performance issues
#1
Hello,
I'm experiencing a puzzling (at least, for me) behaviour while performing a dictionary-only attack on a sha512crypt hash.

Scenario:
Hardware is a server with dual Xeon E5-2603v2 CPU, 32GB RAM, 4x Nvidia Tesla K20m with 5GB memory each.
Software is Linux Centos 6.5 64 bit, nvidia 319.37 driver from official Nvidia repos, oclHashcat 1.01.

SElinux is disabled and the system isn't doing anything else.

my command line is:

Code:
cudaHashcat64.bin --gpu-accel=1 --gpu-loops=1024 -m 1800 /root/tests/sha512.hashes /root/tests/60milliondict.txt

The sha512.hashes contains ten sha512 hashes (with different salts) and the dictionary contains 60 million 8-char-long passwords.

The gpu-accel and gpu-loops parameters were choosen through exhaustive benchmarking (= trying a set of possible gpu-accel and gpu-loops combos).

What happens is that I during bruteforcing I get 400% CPU usage; it's a dual core, 4-core-per CPU, non hyperthreaded machine, so thus 800% would mean the server is truly completely CPU saturated, but I still don't understand what's going on; I thought that dictionary-only attacks weren't really CPU bound, since there was no wordlist to generate on the fly.

There's something else bothering me as well: if I check the status while cracking, the data is something like that:

Quote:Speed.GPU.#1...: 628 H/s
Speed.GPU.#2...: 635 H/s
Speed.GPU.#3...: 635 H/s
Speed.GPU.#4...: 635 H/s
Speed.GPU.#*...: 2533 H/s
Recovered......: 0/10 (0.00%) Digests, 0/10 (0.00%) Salts
Progress.......: 692224/600000000 (0.12%)
Rejected.......: 0/692224 (0.00%)
HWMon.GPU.#1...: 99% Util, 36c Temp, -1% Fan
HWMon.GPU.#2...: 99% Util, 37c Temp, -1% Fan
HWMon.GPU.#3...: 99% Util, 39c Temp, -1% Fan
HWMon.GPU.#4...: 99% Util, 40c Temp, -1% Fan

While the GPU utilization is high, the machine actually seems able to crack 2500*10 hashes in the file = 25K/s password (consistent with the benchmark below), which is a bit low IMHO, since I saw posts like that:

http://hashcat.net/forum/archive/index.p...-2340.html

Where a dual Tesla K20m system seems to perform at the same level as mine. Tthe sha512crypt benchmark is not available in such page, but the sha512 benchmark is the same despite my system having 2x the GPUs of the posted one.

This is my sha512 + sha512crypt benchmark result :

Quote:cudaHashcat64.bin -b --benchmark-mode 1 -m 1700
cudaHashcat v1.01 starting in benchmark-mode...

Device #1: Tesla K20m, 4799MB, 705Mhz, 13MCU
Device #2: Tesla K20m, 4799MB, 705Mhz, 13MCU
Device #3: Tesla K20m, 4799MB, 705Mhz, 13MCU
Device #4: Tesla K20m, 4799MB, 705Mhz, 13MCU

Hashtype: SHA512
Workload: 128 loops, 256 accel

Speed.GPU.#1.: 50713.8 kH/s
Speed.GPU.#2.: 50854.4 kH/s
Speed.GPU.#3.: 51067.6 kH/s
Speed.GPU.#4.: 50880.3 kH/s
Speed.GPU.#*.: 203.5 MH/s

Quote:cudaHashcat v1.01 starting in benchmark-mode...

Device #1: Tesla K20m, 4799MB, 705Mhz, 13MCU
Device #2: Tesla K20m, 4799MB, 705Mhz, 13MCU
Device #3: Tesla K20m, 4799MB, 705Mhz, 13MCU
Device #4: Tesla K20m, 4799MB, 705Mhz, 13MCU

Hashtype: sha512crypt, SHA512(Unix)
Workload: 5000 loops, 8 accel

Speed.GPU.#1.: 6321 H/s
Speed.GPU.#2.: 6383 H/s
Speed.GPU.#3.: 6406 H/s
Speed.GPU.#4.: 6401 H/s
Speed.GPU.#*.: 25511 H/s

But I've noticed other strange behaviours:

- Using --cpu-affinity to limit CPU usage seems to lower the system load (e.g. top just shows 100% load) but the bruteforcing performance stays the same at about 25K H/s.
- Letting one single GPU device with -d to fully employ the CPUs doesnt' improve speed (the single GPU cracks about 6.3K hashes/s)


So, my questions are:

- Is it normal for the CPU usage to be that high?
- Might the system be CPU bound?
- Is there any way to improve my performance?

Thanks to anyone that can help me.
#2
Maybe you get confused because benchmark uses a single salt while you are trying to crack 10 salts at once. So your speed is actually 25kH/s which matches your benchmark results
#3
this is a salted algorithm and you have 10 salts, so your effective speed will be 10x slower than if you were cracking a single hash. 2533 H/s * 10 = 25330 H/s which is damn close to your benchmark value of 25511 H/s. so your performance is pretty close to the maximum performance under ideal conditions. you could squeeze a bit more performance out of it if you set gpu-loops to 5000, since sha512crypt is iterated 5000 times.

i'm not entirely sure why it's using that much cpu, but as you said the load avg is only 400 and not 800, so you are not cpu bound.

regarding the benchmark of the other system you posted... that was from several months ago, so that was likely using a different driver. i know several users have complained about large speed drops with newer nvidia drivers, so that could be part of the problem. other thing to consider is that those benchmarks were made with oclHashcat-lite, which does not exist anymore. the functionality of that product was rolled into oclHashcat, and while there was not much performance impact for most algorithms on AMD, i have no idea how nvidia was affected. so that could be part of the issue as well, but i do not know for sure.

can't really complain about the speed with your hardware, though. if you wanted it to be fast, you shouldn't have bought nvidia cards Smile
#4
Hello, thank you; I think I wasn't clear enough, I know that my result for 10 hashes cracking is consistent with the benchmark, that's not the issue;

The REAL issue is why a system with half the video cards performed as fast as mine.

And about the speed: I think you're unfair to me :-) I wanted a) a very stable system that could be left unattended, and b) a system which cracked sha512crypt hashes effectively. AFAIK Nvidia is as fast as AMD regarding sha512 hashes.
#5
(02-24-2014, 01:51 PM)afra Wrote: The REAL issue is why a system with half the video cards performed as fast as mine.

that was with a different program, with likely a different driver, so you're not really comparing apples and apples.

(02-24-2014, 01:51 PM)afra Wrote: And about the speed: I think you're unfair to me :-) I wanted a) a very stable system that could be left unattended, and b) a system which cracked sha512crypt hashes effectively. AFAIK Nvidia is as fast as AMD regarding sha512 hashes.

"a" was a valid point several years ago, but nowadays amd gpus are just as stable and reliable on linux as nvidia gpus. as for "b", no, nvidia is not as fast as amd on sha512 hashes. even with the 55% loss in 64-bit performance compared to previous generations, the R9 290X is twice as fast on sha512 as your Tesla at 1/5th the price.
#6
Please note: I'm an oclHashcat newbie, I don't want to complain, I'm just asking and trying to make my points clear.

(02-24-2014, 02:04 PM)epixoip Wrote:
(02-24-2014, 01:51 PM)afra Wrote: The REAL issue is why a system with half the video cards performed as fast as mine.

that was with a different program, with likely a different driver, so you're not really comparing apples and apples.

Sure. I wasn't expecting a 100% matching result, and I wasn't expecting 2x speed with double the cards, since I understand not everything can be perfectly parallel.

I wasn't expecting such a performance drop, either . It's true that oclHashcat-lite was a different program, but since it was merged into oclHashcat I was hoping for the performance to be more-or-less the same.

I was able to retrieve a copy of oclHashcat-lite 0.15 and it scores about 370MH/s in benchmark mode; so I guess it's actually a an oclHashcat vs oclHashcat-lite difference.


(02-24-2014, 02:04 PM)epixoip Wrote:
(02-24-2014, 01:51 PM)afra Wrote: And about the speed: I think you're unfair to me :-) I wanted a) a very stable system that could be left unattended, and b) a system which cracked sha512crypt hashes effectively. AFAIK Nvidia is as fast as AMD regarding sha512 hashes.

"a" was a valid point several years ago, but nowadays amd gpus are just as stable and reliable on linux as nvidia gpus. as for "b", no, nvidia is not as fast as amd on sha512 hashes. even with the 55% loss in 64-bit performance compared to previous generations, the R9 290X is twice as fast on sha512 as your Tesla at 1/5th the price.

In our scenario, reliability is king. I won't enter into details, but we needed something that we could put in a remote data center and just call the support if something bad happened. We looked for somebody who could provide us with a fully supported ATI based cracking box, but we were not able to find one.

The most famous company which specializes in GPU hardware and would provide us with a Radon based solution, for example (I'm not saying the name since I don't know whether I'm allowed to), gave us just a 1-year warranty for graphic cards, and the hardware support said that we had to send back the whole server to them for repairs.

On the contrary we were able to find a 5-year-warranty Tesla box with 24x7 support that gives us the warranty of an on-site intervention and an hardware fix within 8 hours. Moreover, the 4x Radeon 7970 box would have been faster, but the total system price was in the same range (+/- 1.000 EUR) of the Tesla K20m. Sounds strange? If I can remember it right, the Radeon company told us that the high price was because they hand-picked the most reliable 7970 cards for the task and they adopted a custom cooling system.

I knew that the Tesla were slower, price-wise, than the Radeon solutions: I just didn't think that they were THAT slow, based on the benchmarks that were available to me at the time.

Anyway, everything seems normal and there's not so much space for improvement unless new versions of oclHashcat get better; I'll try running a matching benchmark on oclHashcat and oclHashcat-lite and maybe opening a ticket to the authors explaining the notable difference (maybe it's just a small glitch that wasn't highlighted since cuda is a bit of second-class citizen in oclHashcat).

Thank you for your explanations and support!