PCIe transfer compression
While trying to find the cause of the performance loss of WPA in version 4.x discussed in this thread https://hashcat.net/forum/thread-7251.html we found the root of the problem and were able to build a solution.

This solution base on a new technique to copy the data from the host system to the GPU memory, thus this post.

The problem was introduced while increasing the maximum password length from 64 (in theory, in reality 16-55) to 256 (always). The data type which holds a single password candidate increased from 68 byte to 260 byte. While we transfer the data it blocks the GPU from compute. In other words: The larger the buffer we want to transfer the longer it takes to copy data. The longer it takes to copy the data the less busy the GPU is. The less busy the GPU is the slower the cracking is.

The increase of 200 byte per password candidate doesn't sound very much but it actually is a lot. To give you some numbers and a feeling on how this affects the performance, here's an example: My GTX1080 (cracking WPA) did ~360kH/s with v3.40 and now does ~300kH/s. This is a 20% loss! It is imporant to note that this GTX1080 is connected via PCIe extender and therefore only gets x1 PCIe multiplier (instead of x16 multiplier if connected directly to the mainboard).

So what to do? We can't just roll back because we want to support long passwords but we also wont tolerate the drop in speed. What about a special compression function? The goal is to transfer less data via PCIe by compressing the "padded data" of each candidate. On average a password candidate only uses 8 byte of the 260 byte that are transfered, the rest is padded with zeros which in turn wastes 252 byte. It would therefor make sense to get rid of the extra zeros, like for example with some kind of "compression" algorithm.

I've designed and implemented such a compression function on both the host and the GPU. Instead of writing the candidates into a huge buffer that is pre-aligned for cracking tasks, it copies the candidates onto a stream (which is still aligned but only to 4 byte), while maintaining an index table at the same time in an extra buffer. This way we have to copy two buffers, but they are (when add together) much smaller than the single huge one (the padded one). But then, for the cracking kernel, to work optimally and as fast as possible, it still needs to access the password candidates in an aligned fashion (with the padded zeros). To achieve this I implemented a new "decompression" kernel which runs on the GPU side and is called after both buffers have been copied but before the actual cracking kernel starts. This kernel reads the index table and the password candidate stream and rebuilds that one huge padded password candidate buffer. Since this action is done completely ony the GPU (buffer reads and writes) no data is transfered via PCI-Express and achieves a very high decompression speed since it can be parallelized perfectly.

The results look as following:

  • v4.0.1 -> v4.1.0 (750Ti @ pci x16): 5.31 MH/s -> 6.82 MH/s +  28%
  • v4.0.1 -> v4.1.0 (1080  @ pci x1):  1.47 MH/s -> 8.45 MH/s + 574%

  • v4.0.1 -> v4.1.0 (750Ti @ pci x16):  805 kH/s ->  839 kH/s +   1%
  • v4.0.1 -> v4.1.0 (1080  @ pci x1):  1.21 MH/s -> 3.86 MH/s + 319%

  • v4.0.1 -> v4.1.0 (750Ti @ pci x16):   60 kH/s ->   61 kH/s +   1%
  • v4.0.1 -> v4.1.0 (1080  @ pci x1):   298 kH/s ->  357 kH/s +  17%

As you can see, this change has a very large effect on PCIe x1, but also some gains on PCIe x16. 

Of course it works with all hash-modes in combination with dictionary based attacks, not just the three hash-modes in the example.

- atom
Thanks for your effort and it's great to see the improved results!
Thank you for all the hard work you put into this. You guys do a great job fixing bugs and finding efficient solutions quickly.