Massive slow performance from 900 GH/s to 37 MH/s
#1
Hello Evyerone, hope you are doing well Smile

I am getting a massive slow performance on my crackstation with hashcat.

Indeed, I am running a crackstation with 8x Tesla Nvidia A100 GPU.

Here is what I get when running a benchmark for NTLM hashes
Code:
sudo time hashcat -a0 -m 1000 hashes/a.txt wordlists/finalweak.txt -O --force -w 4

Code:
Hashmode: 1000 - NTLM

Speed.#1.........:  116.1 GH/s (7.67ms) @ Accel:32 Loops:1024 Thr:256 Vec:1
Speed.#2.........:  116.0 GH/s (7.67ms) @ Accel:32 Loops:1024 Thr:256 Vec:1
Speed.#3.........:  116.0 GH/s (7.67ms) @ Accel:32 Loops:1024 Thr:256 Vec:1
Speed.#4.........:  116.0 GH/s (7.67ms) @ Accel:32 Loops:1024 Thr:256 Vec:1
Speed.#5.........:  116.0 GH/s (7.67ms) @ Accel:32 Loops:1024 Thr:256 Vec:1
Speed.#6.........:  116.0 GH/s (7.67ms) @ Accel:32 Loops:1024 Thr:256 Vec:1
Speed.#7.........:  116.0 GH/s (7.67ms) @ Accel:32 Loops:1024 Thr:256 Vec:1
Speed.#8.........:  116.0 GH/s (7.67ms) @ Accel:32 Loops:1024 Thr:256 Vec:1
Speed.#*.........:  928.2 GH/s

However, when trying to crack a single NTLM hash I don't get this power, I only get about 37230.6 kH/s. Since the benchmark said 928 GH/s it's a bit weird to only get 37230.6 kH/s

Here is the output I got when running this command :



Code:
hashcat (v4.0.1) starting...

nvmlDeviceGetFanSpeed(): Not Supported

nvmlDeviceGetFanSpeed(): Not Supported

nvmlDeviceGetFanSpeed(): Not Supported

nvmlDeviceGetFanSpeed(): Not Supported

nvmlDeviceGetFanSpeed(): Not Supported

nvmlDeviceGetFanSpeed(): Not Supported

nvmlDeviceGetFanSpeed(): Not Supported

nvmlDeviceGetFanSpeed(): Not Supported

OpenCL Platform #1: NVIDIA Corporation
======================================
* Device #1: A100-SXM4-40GB, 10134/40537 MB allocatable, 108MCU
* Device #2: A100-SXM4-40GB, 10134/40537 MB allocatable, 108MCU
* Device #3: A100-SXM4-40GB, 10134/40537 MB allocatable, 108MCU
* Device #4: A100-SXM4-40GB, 10134/40537 MB allocatable, 108MCU
* Device #5: A100-SXM4-40GB, 10134/40537 MB allocatable, 108MCU
* Device #6: A100-SXM4-40GB, 10134/40537 MB allocatable, 108MCU
* Device #7: A100-SXM4-40GB, 10134/40537 MB allocatable, 108MCU
* Device #8: A100-SXM4-40GB, 10134/40537 MB allocatable, 108MCU

OpenCL Platform #2: The pocl project
====================================
* Device #9: pthread-Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz, skipped.

Hashes: 1 digests; 1 unique digests, 1 unique salts
Bitmaps: 16 bits, 65536 entries, 0x0000ffff mask, 262144 bytes, 5/13 rotates
Rules: 1

Applicable optimizers:
* Optimized-Kernel
* Zero-Byte
* Precompute-Init
* Precompute-Merkle-Demgard
* Meet-In-The-Middle
* Early-Skip
* Not-Salted
* Not-Iterated
* Single-Hash
* Single-Salt
* Raw-Hash

Password length minimum: 0
Password length maximum: 27

Watchdog: Temperature abort trigger set to 90c
Watchdog: Temperature retain trigger disabled.

* Device #1: build_opts '-I /usr/share/hashcat/OpenCL -D VENDOR_ID=32 -D CUDA_ARCH=800 -D AMD_ROCM=0 -D VECT_SIZE=1 -D DEVICE_TYPE=4 -D DGST_R0=0 -D DGST_R1=3 -D DGST_R2=2 -D DGST_R3=1 -D DGST_ELEM=4 -D KERN_TYPE=1000 -D _unroll'
* Device #2: build_opts '-I /usr/share/hashcat/OpenCL -D VENDOR_ID=32 -D CUDA_ARCH=800 -D AMD_ROCM=0 -D VECT_SIZE=1 -D DEVICE_TYPE=4 -D DGST_R0=0 -D DGST_R1=3 -D DGST_R2=2 -D DGST_R3=1 -D DGST_ELEM=4 -D KERN_TYPE=1000 -D _unroll'
* Device #3: build_opts '-I /usr/share/hashcat/OpenCL -D VENDOR_ID=32 -D CUDA_ARCH=800 -D AMD_ROCM=0 -D VECT_SIZE=1 -D DEVICE_TYPE=4 -D DGST_R0=0 -D DGST_R1=3 -D DGST_R2=2 -D DGST_R3=1 -D DGST_ELEM=4 -D KERN_TYPE=1000 -D _unroll'
* Device #4: build_opts '-I /usr/share/hashcat/OpenCL -D VENDOR_ID=32 -D CUDA_ARCH=800 -D AMD_ROCM=0 -D VECT_SIZE=1 -D DEVICE_TYPE=4 -D DGST_R0=0 -D DGST_R1=3 -D DGST_R2=2 -D DGST_R3=1 -D DGST_ELEM=4 -D KERN_TYPE=1000 -D _unroll'
* Device #5: build_opts '-I /usr/share/hashcat/OpenCL -D VENDOR_ID=32 -D CUDA_ARCH=800 -D AMD_ROCM=0 -D VECT_SIZE=1 -D DEVICE_TYPE=4 -D DGST_R0=0 -D DGST_R1=3 -D DGST_R2=2 -D DGST_R3=1 -D DGST_ELEM=4 -D KERN_TYPE=1000 -D _unroll'
* Device #6: build_opts '-I /usr/share/hashcat/OpenCL -D VENDOR_ID=32 -D CUDA_ARCH=800 -D AMD_ROCM=0 -D VECT_SIZE=1 -D DEVICE_TYPE=4 -D DGST_R0=0 -D DGST_R1=3 -D DGST_R2=2 -D DGST_R3=1 -D DGST_ELEM=4 -D KERN_TYPE=1000 -D _unroll'
* Device #7: build_opts '-I /usr/share/hashcat/OpenCL -D VENDOR_ID=32 -D CUDA_ARCH=800 -D AMD_ROCM=0 -D VECT_SIZE=1 -D DEVICE_TYPE=4 -D DGST_R0=0 -D DGST_R1=3 -D DGST_R2=2 -D DGST_R3=1 -D DGST_ELEM=4 -D KERN_TYPE=1000 -D _unroll'
* Device #8: build_opts '-I /usr/share/hashcat/OpenCL -D VENDOR_ID=32 -D CUDA_ARCH=800 -D AMD_ROCM=0 -D VECT_SIZE=1 -D DEVICE_TYPE=4 -D DGST_R0=0 -D DGST_R1=3 -D DGST_R2=2 -D DGST_R3=1 -D DGST_ELEM=4 -D KERN_TYPE=1000 -D _unroll'
Dictionary cache hit:
* Filename..: wordlists/finalweak.txt
* Passwords.: 15639992272
* Bytes.....: 177992863744
* Keyspace..: 15639992272

- Device #4: autotuned kernel-accel to 256               
- Device #4: autotuned kernel-loops to 1
- Device #3: autotuned kernel-accel to 256               
- Device #3: autotuned kernel-loops to 1
- Device #5: autotuned kernel-accel to 256               
- Device #5: autotuned kernel-loops to 1
- Device #1: autotuned kernel-accel to 256               
- Device #1: autotuned kernel-loops to 1
- Device #2: autotuned kernel-accel to 256               
- Device #2: autotuned kernel-loops to 1
- Device #8: autotuned kernel-accel to 256               
- Device #8: autotuned kernel-loops to 1
- Device #6: autotuned kernel-accel to 256               
- Device #6: autotuned kernel-loops to 1
- Device #7: autotuned kernel-accel to 256               
- Device #7: autotuned kernel-loops to 1

Session..........: hashcat
Status...........: Running
Hash.Type........: NTLM
Hash.Target......:
Time.Started.....: Tue Sep 28 15:32:45 2021 (30 secs)
Time.Estimated...: Tue Sep 28 15:40:02 2021 (6 mins, 47 secs)
Guess.Base.......: File (wordlists/finalweak.txt)
Guess.Queue......: 1/1 (100.00%)
Speed.Dev.#1.....:  4461.3 kH/s (1.92ms)
Speed.Dev.#2.....:  4904.3 kH/s (1.92ms)
Speed.Dev.#3.....:  4222.8 kH/s (1.94ms)
Speed.Dev.#4.....:  4449.5 kH/s (1.95ms)
Speed.Dev.#5.....:  4350.8 kH/s (1.93ms)
Speed.Dev.#6.....:  5675.5 kH/s (1.92ms)
Speed.Dev.#7.....:  4587.4 kH/s (1.92ms)
Speed.Dev.#8.....:  4579.0 kH/s (1.92ms)
Speed.Dev.#*.....: 37230.6 kH/s
Recovered........: 0/1 (0.00%) Digests, 0/1 (0.00%) Salts
Progress.........: 486237586/15639992272 (3.11%)
Rejected.........: 4941202/486237586 (1.02%)
Restore.Point....: 457636015/15639992272 (2.93%)
Candidates.#1....: InFusion121 -> ZRMFY
Candidates.#2....: THEBEATL -> aymhmanaza
Candidates.#3....: PopularZebra8632 -> alex_aria
Candidates.#4....: BNDYQ -> ZRMFIFDY1Ng2iuRl
Candidates.#5....: EJHz -> ZRMFP
Candidates.#6....: 9858457 -> THANHCHUONGPRO
Candidates.#7....: 6zl8doon -> TESIE 1996
Candidates.#8....: MOHsin@ -> THEBEATIFUL
HWMon.Dev.#1.....: Temp: 60c Util:  0% Core:1410MHz Mem:1215MHz Bus:16
HWMon.Dev.#2.....: Temp: 53c Util:  0% Core:1410MHz Mem:1215MHz Bus:16
HWMon.Dev.#3.....: Temp: 65c Util:  0% Core:1410MHz Mem:1215MHz Bus:16
HWMon.Dev.#4.....: Temp: 53c Util:  0% Core:1410MHz Mem:1215MHz Bus:16
HWMon.Dev.#5.....: Temp: 59c Util:  0% Core:1410MHz Mem:1215MHz Bus:16
HWMon.Dev.#6.....: Temp: 52c Util:  0% Core:1410MHz Mem:1215MHz Bus:16
HWMon.Dev.#7.....: Temp: 65c Util:  0% Core:1410MHz Mem:1215MHz Bus:16
HWMon.Dev.#8.....: Temp: 55c Util:  0% Core:1410MHz Mem:1215MHz Bus:16

I tried to find a solution by monitoring the GPU and got this result
Code:
[0] A100-SXM4-40GB  | 60'C,  0 % | 10860 / 40537 MB | root(10857M)
[1] A100-SXM4-40GB  | 53'C,  0 % | 10860 / 40537 MB | root(10857M)
[2] A100-SXM4-40GB  | 65'C,  0 % | 10860 / 40537 MB | root(10857M)
[3] A100-SXM4-40GB  | 53'C,  0 % | 10860 / 40537 MB | root(10857M)
[4] A100-SXM4-40GB  | 59'C,  0 % | 10860 / 40537 MB | root(10857M)
[5] A100-SXM4-40GB  | 52'C,  0 % | 10860 / 40537 MB | root(10857M)
[6] A100-SXM4-40GB  | 65'C,  0 % | 10860 / 40537 MB | root(10857M)
[7] A100-SXM4-40GB  | 55'C,  41 % | 10860 / 40537 MB | root(10857M)

I am wondering why hashcat only use 41% of the last GPU instead using all the GPU at the same time at a full power ?

I appreciate any help, thank you Smile
Reply
#2
https://hashcat.net/wiki/doku.php?id=fre...ck_so_slow

Specifically the part about creating more work.
Reply
#3
(09-28-2021, 05:42 PM)Xanadrel Wrote: https://hashcat.net/wiki/doku.php?id=fre...ck_so_slow

Specifically the part about creating more work.

Thanks for the ressource
NTLM must be categorized as a fast hashes

I don't see how mask or rules based attacks could increase the speed
Reply
#4
You should probably start by using a version of hashcat that isnt 4 years old. After you update to a more recent version of hashcat and retest, make sure that whatever command you are running doesn't include --force. Once you've got both of those figured out, then you can worry about adding more work or restructuring your attack to achieve a proper workload and better speeds.
Reply
#5
(09-29-2021, 11:19 AM)Chick3nman Wrote: You should probably start by using a version of hashcat that isnt 4 years old. After you update to a more recent version of hashcat and retest, make sure that whatever command you are running doesn't include --force. Once you've got both of those figured out, then you can worry about adding more work or restructuring your attack to achieve a proper workload and better speeds.

Thanks for the reply chickenman.

I used a 60GB wordlist combined with d3ad0ne rules and got 200 GH/s which is way way way faster than the 37 MH/s with the simple wordlist without rules. Do you have any explanation on why it gets faster with rules, that puzzles me.

However, 200 GH/s is still not the 900 GH/s expected by the benchmark (even if I know that the benchmark runs with the best conditions and optimisation ...), what can I do to increase the speed again ?
Reply
#6
https://hashcat.net/wiki/doku.php?id=fre...full_speed
short answer
base loop and mod loop (rules), rules are applied and the gpu resulting in higher hashrates (rules acting as amplifier)
Reply