hashcat Forum
Is using CUDA runtime always prefer over OpenCL ?? - Printable Version

+- hashcat Forum (https://hashcat.net/forum)
+-- Forum: Misc (https://hashcat.net/forum/forum-15.html)
+--- Forum: Hardware (https://hashcat.net/forum/forum-13.html)
+--- Thread: Is using CUDA runtime always prefer over OpenCL ?? (/thread-11475.html)



Is using CUDA runtime always prefer over OpenCL ?? - Gyfer - 06-25-2023

Everywhere about hashcat, I read, choose CUDA over OpenCL because is faster in Cuda.
But my result shows me is not.

What is your CUDA and OpenCL speed ??

Running OpenCL:
Code:
user@compute2:$ hashcat -b -m 22000 -d 4
hashcat (v6.2.5) starting in benchmark mode

CUDA API (CUDA 11.4)
====================
* Device #1: NVIDIA GeForce GTX 1660 SUPER, skipped

OpenCL API (OpenCL 3.0 CUDA 11.4.364) - Platform #3 [NVIDIA Corporation]
========================================================================
* Device #4: NVIDIA GeForce GTX 1660 SUPER, 5824/5944 MB (1486 MB allocatable), 22MCU

Benchmark relevant options:
===========================
* --backend-devices=4
* --optimized-kernel-enable

-------------------------------------------------------------
* Hash-Mode 22000 (WPA-PBKDF2-PMKID+EAPOL) [Iterations: 4095]
-------------------------------------------------------------

Speed.#4...test1......:  303.6 kH/s (73.45ms) @ Accel:64 Loops:256 Thr:256 Vec:1
Speed.#4...test2......:  303.9 kH/s (73.54ms) @ Accel:128 Loops:128 Thr:256 Vec:1
Speed.#4...test3......:  302.7 kH/s (73.85ms) @ Accel:128 Loops:128 Thr:256 Vec:1

Started: Sun Jun 25 07:22:57 2023
Stopped: Sun Jun 25 07:23:33 2023
user@compute2:~$

running CIUDA:
Code:
user@compute2:$ hashcat -b -m 22000 -d 1
hashcat (v6.2.5) starting in benchmark mode

CUDA API (CUDA 11.4)
====================
* Device #1: NVIDIA GeForce GTX 1660 SUPER, 5872/5944 MB, 22MCU

OpenCL API (OpenCL 3.0 CUDA 11.4.364) - Platform #3 [NVIDIA Corporation]
========================================================================
* Device #4: NVIDIA GeForce GTX 1660 SUPER, skipped

Benchmark relevant options:
===========================
* --backend-devices=1
* --optimized-kernel-enable
[
-------------------------------------------------------------
* Hash-Mode 22000 (WPA-PBKDF2-PMKID+EAPOL) [Iterations: 4095]
-------------------------------------------------------------

Speed.#1..test1..:  300.5 kH/s (73.41ms) @ Accel:32 Loops:256 Thr:512 Vec:1
Speed.#1..test2..:  301.2 kH/s (73.24ms) @ Accel:32 Loops:256 Thr:512 Vec:1
Speed.#1..test3..:  300.0 kH/s (73.18ms) @ Accel:16 Loops:512 Thr:512 Vec:1

Started: Sun Jun 25 07:31:32 2023
Stopped: Sun Jun 25 07:31:37 2023
user@compute2:~$

That's is about 1% of different, which I think is quite significant.


RE: Is using CUDA runtime always prefer over OpenCL ?? - marc1n - 06-25-2023

Your test must be performed on a hash, not a benchmark. Speeds also depend on the type of hash


RE: Is using CUDA runtime always prefer over OpenCL ?? - lapsikmees - 06-25-2023

Also you don't have latest hastcat and cuda.


RE: Is using CUDA runtime always prefer over OpenCL ?? - lapsikmees - 06-25-2023

(06-25-2023, 10:46 PM)aikiuslik Wrote: Also you don't have latest hastcat and cuda.
Code:
CUDA API (CUDA 12.2)
====================
* Device #1: NVIDIA GeForce RTX 3060, 11261/12287 MB, 28MCU

OpenCL API (OpenCL 3.0 CUDA 12.2.79) - Platform #1 [NVIDIA Corporation]
=======================================================================
* Device #2: NVIDIA GeForce RTX 3060, skipped

Benchmark relevant options:
===========================
* --optimized-kernel-enable

-------------------------------------------------------------
* Hash-Mode 22000 (WPA-PBKDF2-PMKID+EAPOL) [Iterations: 4095]
-------------------------------------------------------------

Speed.#1.........:  396.8 kH/s (70.27ms) @ Accel:8 Loops:1024 Thr:512 Vec:1

Started: Sun Jun 25 23:49:08 2023
Stopped: Sun Jun 25 23:49:24 2023

CUDA API (CUDA 12.2)
====================
* Device #1: NVIDIA GeForce RTX 3060, skipped

OpenCL API (OpenCL 3.0 CUDA 12.2.79) - Platform #1 [NVIDIA Corporation]
=======================================================================
* Device #2: NVIDIA GeForce RTX 3060, 12160/12287 MB (3071 MB allocatable), 28MCU

Benchmark relevant options:
===========================
* --backend-devices=2
* --optimized-kernel-enable

-------------------------------------------------------------
* Hash-Mode 22000 (WPA-PBKDF2-PMKID+EAPOL) [Iterations: 4095]
-------------------------------------------------------------

Speed.#2.........:  396.5 kH/s (71.25ms) @ Accel:32 Loops:512 Thr:256 Vec:1

Started: Sun Jun 25 23:50:04 2023
Stopped: Sun Jun 25 23:50:20 2023



RE: Is using CUDA runtime always prefer over OpenCL ?? - Gyfer - 06-27-2023

(06-25-2023, 10:51 PM)aikiuslik Wrote: [quote="aikiuslik" pid='58485' dateline='1687726001']
Also you don't have latest hastcat and cuda.

According to the changes.txt file in the Hashcat Documentation in GitHub on GitHub, there doesn't appear to be any improvements or additional benefits in upgrading to version 6.2.6 if my primarily use algorithm is 22000. As a result, there doesn't seem to be a reason or necessity to switch to 6.2.6 if version 6.2.5 is functioning well for me.

Regarding the NVIDIA Cuda documentation, and the NVIDIA Driver Archive for Unix , considering that I have an Nvidia GT 730 graphics card in my PC (for backward compatibility reason), the latest Nvidia driver available for Linux is version 470.xx. For this driver version (470.xx), the latest compatible CUDA version that can be installed is CUDA 11.4. It's worth noting that newer CUDA versions require newer driver versions.

Based on the above, I believe I have the latest "compatible" versions of Hashcat and CUDA for my setup. :-)

Now, I have trouble to get RX580 to compute in Hashcat. I won't be trying ROCm since AMD drop the support in ROCm 4.0,  any pointers to get it run in OpenCL with Ubuntu 20.04 Focal Fossa ?

Over here,  NVIDIA GTX 1060 still goes around ( forex rate : 4.66) USD75 , and RX580 4GB can get around USD 43. 
GTX 1060 have approx. 22000 speed rate of 186kH/s running in CUDA 11.x  (ref )
For RX580, I believe I can get around 220kH/s running in Windows OpenCL. 
Getting two RX580 for less than USD100 and hashrate of above 400++kH/s is not bad, I think. 

AMD driver tried (or trying) :
1. https://github.com/RadeonOpenCompute/ROCm
2. https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime
1. https://github.com/xuhuisheng/rocm-gfx803
2. https://www.videogames.ai/Install-ROCM-Machine-Learning-AMD-GPU
5. https://github.com/ptitjes/opencl-amd
Even result in cmake build:
1. CMake 3.20-custom
2. OpenCL in Linux

It has been 4 days to get AMD to hash! *meow*


RE: Is using CUDA runtime always prefer over OpenCL ?? - lapsikmees - 06-27-2023

Why don't you use windows with cuda? Much easier.


RE: Is using CUDA runtime always prefer over OpenCL ?? - Gyfer - 06-28-2023

(06-27-2023, 10:17 AM)aikiuslik Wrote: Why don't you use windows with cuda? Much easier.

Too much overhead and headless unfriendly.  At one point, I did dump Windows over Ubuntu because of Docker until I found WSL2. 

Anyhow , here is the follow up of CUDA vs OpenCL:
P4000 seem to run a lot slower in OpenCL, other than that, there're improvement across GTX 1660 Super and GT730.

Here the benchmark. Did a few runs, pick up the best speed among it. 

Code:
hashcat (v6.2.5) starting in benchmark mode
* Device #7: This hardware has outdated CUDA compute capability (3.5).
            For modern OpenCL performance, upgrade to hardware that supports
            CUDA compute capability version 5.0 (Maxwell) or higher.

CUDA API (CUDA 11.4)
====================
* Device #1: NVIDIA GeForce GTX 1660 SUPER, skipped
* Device #2: Quadro P4000, 8038/8119 MB, 14MCU
* Device #3: NVIDIA GeForce GT 730, skipped

OpenCL API (OpenCL 3.0 PoCL 3.0-rc2  Linux, RelWithDebInfo, RELOC, SPIR, LLVM 10.0.0, SLEEF, POCL_DEBUG) - Platform #1 [The pocl project]
=========================================================================================================================================
* Device #4: pthread-Intel(R) Core(TM) i3-4130 CPU @ 3.40GHz, skipped

OpenCL API (OpenCL 2.0 AMD-APP (3314.0)) - Platform #2 [Advanced Micro Devices, Inc.]
=====================================================================================

OpenCL API (OpenCL 3.0 CUDA 11.4.402) - Platform #3 [NVIDIA Corporation]
========================================================================
* Device #5: NVIDIA GeForce GTX 1660 SUPER, 5824/5944 MB (1486 MB allocatable), 22MCU
* Device #6: Quadro P4000, skipped
* Device #7: NVIDIA GeForce GT 730, 1920/2002 MB (500 MB allocatable), 2MCU

Benchmark relevant options:
===========================
* --backend-devices=2,5,7
* --optimized-kernel-enable

-------------------------------------------------------------
* Hash-Mode 22000 (WPA-PBKDF2-PMKID+EAPOL) [Iterations: 4095]
-------------------------------------------------------------

Speed.#1.........:  300.0 kH/s (73.88ms) @ Accel:8 Loops:1024 Thr:512 Vec:1
Speed.#2.........:  283.2 kH/s (49.96ms) @ Accel:16 Loops:512 Thr:512 Vec:1
Speed.#3.........:    11150 H/s (80.96ms) @ Accel:8 Loops:1024 Thr:256 Vec:1
Speed.#*.........:  594.4 kH/s

Speed.#1.........:  300.2 kH/s (73.82ms) @ Accel:64 Loops:256 Thr:256 Vec:1
Speed.#2.........:  280.8 kH/s (50.35ms) @ Accel:32 Loops:256 Thr:512 Vec:1
Speed.#3.........:    12257 H/s (77.74ms) @ Accel:16 Loops:512 Thr:256 Vec:1
Speed.#*.........:  593.3 kH/s

Speed.#1.........:  299.2 kH/s (74.07ms) @ Accel:64 Loops:256 Thr:256 Vec:1
Speed.#2.........:  285.2 kH/s (49.55ms) @ Accel:8 Loops:1024 Thr:512 Vec:1
Speed.#3.........:    11286 H/s (79.98ms) @ Accel:32 Loops:1024 Thr:64 Vec:1
Speed.#*.........:  595.7 kH/s


Speed.#5.........:  302.4 kH/s (74.11ms) @ Accel:128 Loops:128 Thr:256 Vec:1
Speed.#6.........:  262.5 kH/s (54.25ms) @ Accel:128 Loops:128 Thr:256 Vec:1
Speed.#7.........:    12131 H/s (79.19ms) @ Accel:8 Loops:1024 Thr:256 Vec:1
Speed.#*.........:  577.0 kH/s

Speed.#5.........:  302.3 kH/s (74.14ms) @ Accel:128 Loops:128 Thr:256 Vec:1
Speed.#6.........:  261.4 kH/s (54.47ms) @ Accel:128 Loops:128 Thr:256 Vec:1
Speed.#7.........:    13573 H/s (74.37ms) @ Accel:64 Loops:128 Thr:256 Vec:1
Speed.#*.........:  577.2 kH/s

Speed.#5.........:  302.1 kH/s (74.17ms) @ Accel:128 Loops:128 Thr:256 Vec:1
Speed.#6.........:  262.1 kH/s (54.33ms) @ Accel:128 Loops:128 Thr:256 Vec:1
Speed.#7.........:    13458 H/s (74.41ms) @ Accel:32 Loops:256 Thr:256 Vec:1
Speed.#*.........:  577.7 kH/s



Speed.#5.........:  302.3 kH/s (74.13ms) @ Accel:128 Loops:128 Thr:256 Vec:1
Speed.#2.........:  288.2 kH/s (49.22ms) @ Accel:8 Loops:1024 Thr:512 Vec:1
Speed.#7.........:    12125 H/s (79.22ms) @ Accel:8 Loops:1024 Thr:256 Vec:1
Speed.#*.........:  602.6 kH/s

Speed.#5.........:  302.1 kH/s (74.18ms) @ Accel:64 Loops:256 Thr:256 Vec:1
Speed.#2.........:  286.6 kH/s (49.50ms) @ Accel:8 Loops:1024 Thr:512 Vec:1
Speed.#7.........:    13458 H/s (74.43ms) @ Accel:32 Loops:256 Thr:256 Vec:1
Speed.#*.........:  602.2 kH/s

Speed.#5.........:  302.3 kH/s (74.13ms) @ Accel:128 Loops:128 Thr:256 Vec:1
Speed.#2.........:  286.9 kH/s (49.48ms) @ Accel:8 Loops:1024 Thr:512 Vec:1
Speed.#7.........:    13570 H/s (74.56ms) @ Accel:128 Loops:64 Thr:256 Vec:1
Speed.#*.........:  602.7 kH/s