I'm just afraid the CPU would become the bottleneck here. My unoptimised password generation algorithm does around 5-10 M/s per core on my laptop, I need a 10-20x in speed to keep up with a single GPU, assuming 4 beefy cpu cores.
Also bandwidth seemed like a problem to me, if the passwords are 20 chars long * 800e6 that's 16 GB/s, GPU buses are slower than that, right? A quick search seems to show a single PCIe 2.0 16x bus is less than 4 GB/s.
Are you suggesting to send combinations only, keeping the dictionaries in the GPU's ram? Maybe that way I could reduce the traffic to say 8 bytes per password, but I would still be way off. Why not doing the password generation inside the GPU when you get to that stage anyway?
For the record, these are the PCI-E speeds for each revision:
Speed For single-lane (×1) and 16-lane (×16) links, in each direction:
v. 1.x (2.5 GT/s):
250 MB/s (×1)
4 GB/s (×16)
v. 2.x (5 GT/s):
500 MB/s (×1)
8 GB/s (×16)
v. 3.x (8 GT/s):
985 MB/s (×1)
15.75 GB/s (×16)
v. 4.0 (16 GT/s):
1.969 GB/s (×1)
31.51 GB/s (×16)
v. 5.0 (32 GT/s):
3.9 GB/s (×1)
63 GB/s (×16)
I was suggesting you do all the combination/modification on your CPU and pipe it into hashcat, which i believe is how you understood it initially. This is pretty common and the reduction in speed will not be very noticeable for your use case. It will still be faster than running it on just your CPU, even if your CPU bottlenecks you for some reason.
That is reasonable. However I would still need to optimize my CPU generation code 10-20x, otherwise I would need to use expensive CPUs with a single GPU. Sure, I can still get 10x the performances, but for 2x the hardware cost.
01-13-2018, 11:29 PM (This post was last modified: 01-14-2018, 03:32 AM by uaioaqo.)
I started gluing those functions together today. (I got my intro to opencl, I have to say it was interesting).
I'm sort of getting the performances I was expecting... But for the AES256 key inversion function, that is making everything else completely useless. I have ~70 M/s on my laptop's Intel 6200 with the AES key inversion function commented out, one tenth of that (9 M/s) with that function enabled. Here is the function:
It doesn't look like anything special to me, so I don't know if I'm doing something wrong with my code, messing up my memories, etc (I tried setting the ks parameter that gets updated a lot in the key inversion function to __local, but it would't let me compile, I'm assuming the compiler is already setting that to use the private memory anyway). Or maybe the function is just slow and I need to live with it.
Edit: would it make sense if when testing the same code on a 1050 I get good performances with no gigantic bottleneck there? Only 1/3rd of the performances when enabling AES? I'm now seeing 200 M/s on a 1050 (I fixed the bug where the compiler was optimising out some functions because I'm terrible at programming).
01-14-2018, 09:45 AM (This post was last modified: 01-14-2018, 09:46 AM by philsmd.)
If I'm not totally mistaken and if the algorithm is really just aes_decrypt (sha256 (sha256 ($pass), $data) ... then your strategy to implement this kernel doesn't make much sense.
The main question that a dev always needs to ask himself before trying to implement a hashcat OpenCL kernel is: is it a slow or fast hash ?
Only (very) slow hashes need to have _init (), _loop () and _comp () kernel functions.
In my opinion the main problem in your attempt to implementation it is that you made the wrong first choice about how to implement this kernel (good thing is: it's actually easy to fix!). If it only uses 2 sha256 iterations (and one compare after the final aes decrypt step), it might be waaaaaayyy too fast to justify a slow-hash-kernel.
init/loop/comp are normally only used when there are a lot of iterations (in most cases even custom/arbitrary/flexible amounts of iterations). It seems that this is not true for this specific algorithm.
You could/should look into how to implement _a0, _a1 and _a3 kernel (which are used for the specific attack modes -a x). of course, you could start to implement only the kernel that you need urgently and afterwards implement the remaining kernels (e.g. if you want to use dictionary attack, you could implement the _a0 kernel first).
It's good that the implementation is actually very easy because it just needs a couple of copy-paste (since all the code blocks are already there within other kernels).
I'm not sure what you mean by these defines for the opencl compiler... we normally just set some variables within src/interface.c and therefore the host code let's the kernel know which defines are needed.... no additional/manual defines are needed in general.
I also think that it would make sense to open a github issue for this (for now: just to ask for support for this algorithm)... and maybe discuss the algorithm details and technical problems you experiencing while implementing it over there.
01-14-2018, 11:33 AM (This post was last modified: 01-14-2018, 11:35 AM by uaioaqo.)
Yes, I actually thought of that before! I figured I wanted to have a single kernel doing password generation and testing, multiple passwords at the time... I kept the legacy m13400_init() naming only because I was copying that function and haven't changed it yet.
The reason I'm also planning on doing password generation in the kernel is the same why I'm passing those defines: I took everything apart and implemented this a standalone program (and I needed to do that so that I could implement my own password generation rules without understanding hashcat). And yes, copying those existing blocks and getting this sort of performance is amazing, it's almost a paradox, optimising the SHA/AES CPU instructions was way more complicated.
I will share the opencl kernel here (and the main standalone program), just in case anyone needs it in the future. I don't think this deserves a github issue but I can do that. I'm just a bit reluctant in sharing this because there was a bug with Electrum discovered just a few days ago allowing any website to read your encrypted wallet seed if you had electrum and the webpage running at the same time... That is kinda of crazy... And maybe sharing a super-fast electrum cracker is not a good idea at least for a couple of months.
01-18-2018, 01:49 PM (This post was last modified: 01-18-2018, 02:32 PM by uaioaqo.)
I've spent the better part of the last few days trying to optimise this AES thingy, and it's killing me, so I'm giving up.
My final figures are ~290 M/s for 2x SHA256 on a 1050 TI (the benchmark for SHA256 shows ~900, so I should have ~450, I attribute the performance drop to me not using the "optimised" version of SHA256 hashcat ships with) - and ~70 M/s with 2x SHA256 + AES.
Here is what I think I know.
1) The 10 * 256 * 4 byte look up tables (10kb) AES uses are sort of redundant, as it's possible to optimise that to 4k using some bit shifts - it's possible to further lower the memory footprint to around 2k rewriting tiny chunks of code. 2k should fit in any cache!
2) The current AES library has two options to handle how to store the look up tables: 1) using constant memory and 2) copying the constant memory to a thread-specific local variable, duplicating the data across every worker (and spilling out of local memory).
3) My Intel 6200 performs much better using constant memory (2x), while my 1050 is 10x slower when using constant memory over local. It seems that even if the local memory data is duplicated across every thread and is forced to spill out of local memory (to global, I guess), it's still much faster than using constant memory on nvidia architectures. I assume this is because of how constant memory works there, since that the memory locations accessed in a wrap are different your access latency increases by the number of threads in a wrap.
4) Using texture cache might be an option to increase performances on nvidia architectures. It's the only memory type that I know of that is shared among all multiprocessors and has a local cache. The local cache should be bigger than the size of the AES lookup tables, so it should be faster. I didn't test this.
5) Constant memory being slower than having a too-big-to-fit local allocation for each thread, duplicating the data, on nvidia, is something I can't wrap my head around.
6) I still believe the performance bottleneck is memory access for the AES lookup tables. When commenting out parts of the key expansion code (example), I get zero performance improvements until I comment out all but a few iterations - then boom 4x improvement. That makes me think the cache is being trashed and only when the compiler optimises out enough of the local data I can finally see an improvement. The idea of being a few lines of code away from it going 3x faster is not healthy for me.
Also interesting to note: when the input to the sha256 function is the same over multiple runs and multiple workers, I get a 2x improvement (180M/s). I think this is another indicator that the cache is being trashed and the same input will always access the same caches, so the gpu is somehow optimising for it.
If there is anyone interested in a bounty to optimise the kernel please contact me! Or if anyone has any suggestion, that is. I believe I really tried everything.
01-20-2018, 06:48 PM (This post was last modified: 01-20-2018, 06:49 PM by uaioaqo.)
I've attached the program so far. You can compile it by changing the included.h with a path to your local hashcat opencl directory, gcc -o main main.c -lOpenCL should compile it just fine under linux (and it should work under osx as well). Changing the define next to AES256_set_decrypt_key in the kernel file should disable AES and pump the speed up from 128 M/s to more than a billion a second (with a 1080).
Unrelated, but I've found discussion in the forum before that dictionary attacks with more than 2 dictionaries are painfully slow, while I don't believe I've seen it slowing anything at all in here.
Uaioago, this looks great! I've had issues with trying to recover an electrum v4 seed, but btcrecover was just too slow. If this could help me get me coins back I'd be super grateful, and I'd be willing to donate a portion of the recovered coins if I get them back :)
Unfortunately I'm a bit of a C noob. Could you explain a bit more in depth how to compile your fix into hashcat and how to run the program?