I've spent the better part of the last few days trying to optimise this AES thingy, and it's killing me, so I'm giving up.
My final figures are ~290 M/s for 2x SHA256 on a 1050 TI (the benchmark for SHA256 shows ~900, so I should have ~450, I attribute the performance drop to me not using the "optimised" version of SHA256 hashcat ships with) - and ~70 M/s with 2x SHA256 + AES.
Here is what I think I know.
1) The 10 * 256 * 4 byte look up tables (10kb) AES uses are sort of redundant, as it's possible to optimise that to 4k using some bit shifts - it's possible to further lower the memory footprint to around 2k rewriting tiny chunks of code. 2k should fit in any cache!
2) The current AES library has two options to handle how to store the look up tables: 1) using constant memory and 2) copying the constant memory to a thread-specific local variable, duplicating the data across every worker (and spilling out of local memory).
3) My Intel 6200 performs much better using constant memory (2x), while my 1050 is 10x slower when using constant memory over local. It seems that even if the local memory data is duplicated across every thread and is forced to spill out of local memory (to global, I guess), it's still much faster than using constant memory on nvidia architectures. I assume this is because of how constant memory works there, since that the memory locations accessed in a wrap are different your access latency increases by the number of threads in a wrap.
4) Using texture cache might be an option to increase performances on nvidia architectures. It's the only memory type that I know of that is shared among all multiprocessors and has a local cache. The local cache should be bigger than the size of the AES lookup tables, so it should be faster. I didn't test this.
5) Constant memory being slower than having a too-big-to-fit local allocation for each thread, duplicating the data, on nvidia, is something I can't wrap my head around.
6) I still believe the performance bottleneck is memory access for the AES lookup tables. When commenting out parts of the key expansion code (example), I get zero performance improvements until I comment out all but a few iterations - then boom 4x improvement. That makes me think the cache is being trashed and only when the compiler optimises out enough of the local data I can finally see an improvement. The idea of being a few lines of code away from it going 3x faster is not healthy for me.
Also interesting to note: when the input to the sha256 function is the same over multiple runs and multiple workers, I get a 2x improvement (180M/s). I think this is another indicator that the cache is being trashed and the same input will always access the same caches, so the gpu is somehow optimising for it.
My final figures are ~290 M/s for 2x SHA256 on a 1050 TI (the benchmark for SHA256 shows ~900, so I should have ~450, I attribute the performance drop to me not using the "optimised" version of SHA256 hashcat ships with) - and ~70 M/s with 2x SHA256 + AES.
Here is what I think I know.
1) The 10 * 256 * 4 byte look up tables (10kb) AES uses are sort of redundant, as it's possible to optimise that to 4k using some bit shifts - it's possible to further lower the memory footprint to around 2k rewriting tiny chunks of code. 2k should fit in any cache!
2) The current AES library has two options to handle how to store the look up tables: 1) using constant memory and 2) copying the constant memory to a thread-specific local variable, duplicating the data across every worker (and spilling out of local memory).
3) My Intel 6200 performs much better using constant memory (2x), while my 1050 is 10x slower when using constant memory over local. It seems that even if the local memory data is duplicated across every thread and is forced to spill out of local memory (to global, I guess), it's still much faster than using constant memory on nvidia architectures. I assume this is because of how constant memory works there, since that the memory locations accessed in a wrap are different your access latency increases by the number of threads in a wrap.
4) Using texture cache might be an option to increase performances on nvidia architectures. It's the only memory type that I know of that is shared among all multiprocessors and has a local cache. The local cache should be bigger than the size of the AES lookup tables, so it should be faster. I didn't test this.
5) Constant memory being slower than having a too-big-to-fit local allocation for each thread, duplicating the data, on nvidia, is something I can't wrap my head around.
6) I still believe the performance bottleneck is memory access for the AES lookup tables. When commenting out parts of the key expansion code (example), I get zero performance improvements until I comment out all but a few iterations - then boom 4x improvement. That makes me think the cache is being trashed and only when the compiler optimises out enough of the local data I can finally see an improvement. The idea of being a few lines of code away from it going 3x faster is not healthy for me.
Also interesting to note: when the input to the sha256 function is the same over multiple runs and multiple workers, I get a 2x improvement (180M/s). I think this is another indicator that the cache is being trashed and the same input will always access the same caches, so the gpu is somehow optimising for it.