strange Electrum Wallet performance on different GPUs
#1
2070s has the same speed as 1080ti on md5 , but on mode 21700 , 2070s is more than twice as fast as 1080ti
tesla-t4 is much slower than 1080ti on md5 , but on mode 21700,tesla-t4 is much more faster than 1080ti

Does anyone know why? Is it because of larger L2 cache ?

--------------------------------------------------------------

Options:
- Hashcat version: 6.1.1
- Hashcat options: -b --benchmark-all -O -w 4 (ie. complete benchmark)


# Hashmode: 0 - MD5
1080ti    Speed.#1.........: 36152.2 MH/s (51.85ms) @ Accel:64 Loops:1024 Thr:1024 Vec:8
2070s    Speed.#1.........: 35970.0 MH/s (74.49ms) @ Accel:64 Loops:1024 Thr:1024 Vec:8
2080ti    Speed.#1.........: 53975.3 MH/s (42.16ms) @ Accel:32 Loops:1024 Thr:1024 Vec:8
tesla-t4  Speed.#1.........: 20213.3 MH/s (132.27ms) @ Accel:64 Loops:1024 Thr:1024 Vec:1


# Hashmode: 21700 - Electrum Wallet (Salt-Type 4) (Iterations: 1023)
1080ti    Speed.#1.........:  208.1 kH/s (367.00ms) @ Accel:8 Loops:1023 Thr:1024 Vec:1
2070s    Speed.#1.........:  464.2 kH/s (480.54ms) @ Accel:8 Loops:1023 Thr:1024 Vec:1
2080ti    Speed.#1.........:  688.3 kH/s (273.79ms) @ Accel:4 Loops:1023 Thr:1024 Vec:1
tesla-t4  Speed.#1.........:  294.1 kH/s (266.28ms) @ Accel:8 Loops:511 Thr:1024 Vec:1


Reference:
https://www.onlinehashcrack.com/tools-be...080-ti.php
https://www.onlinehashcrack.com/tools-be...-super.php
https://www.onlinehashcrack.com/tools-be...080-ti.php
https://www.onlinehashcrack.com/tools-be...sla-t4.php
Reply
#2
after searching for a few days,i think i find something.
Maxwell and Pascal GPUs use a different implementation with separate shared memory but combined L1 cache and texture cache.

Turing’s SM introduces a new unified architecture for shared memory, L1, and texture caching. This unified design allows the L1 cache to leverage resources, increasing its hit bandwidth by 2x per TPC compared to Pascal, and allows it to be reconfigured to grow larger when shared memory allocations are not using all the shared memory capacity. The Turing L1 can be as large as 64 KB in size, combined with a 32 KB per SM shared memory allocation, or it can reduce to 32 KB, allowing 64 KB of allocation to be used for shared memory.

maybe the secp256k1 compute is cache sensitive,i try to put the secp256k1_t struct on shared memory ,and seems the point_mul_xy calculation is twice as fast as before.I am a newbie on cuda develop,maybe i have found the reason?
Reply