Bitslice DES S-boxes with LOP3.LUT instructions
#1
I've been working on implementing a specific algorithm. Everything works, thanks to amongst others, some very useful tips and hints from the forum Smile

On a Titan X (3072 cores), the algorithm runs @~360M/s. This sounds good but in fact it isn't: my basic C++ OpenSSL based CPU code runs @~1M/s on an outdated CPU (1 core). I've isolated the time consuming code: it's the DES implementation I'm using, the one that's also used in e.g. m=3100. A quick test reveals that all algorithms that use this DES implementation are relatively slow.

I've also found that some other algorithms like m=3000 (LM) use a different DES implementation, "Bitslice DES S-boxes with LOP3.LUT instructions". Based on what LM does and the benchmarks speed, this code seems to be significantly faster. Using the code seems to be the way forward for me, to crack crypto data of the algorithm I'm using in a reasonable amount of time. For me as an amateur coder, the LOP3.LUT code is not easy to read / understand though... So I've got two questions:

- Any educated guesses about the speed of a raw DES of the code used in m=3000 versus m=3100?
- Can someone give me some pointer about how to get from the current code in m=3000 to something close to 'keydata = setkey(key)' and 'out = crypt(in, keydata, mode)'?

Thanks a bunch!
#2
Bitsliced DES makes sense with brute-force -a 3 mode. Otherwise the register pressure is too strong and your data gets swapped on global memory, making everything slower than it would be without bitslice. That's why I don't do bitslice with -a0 and -a1 kernel. Or in other words, if your kernel does anything else than DES, then you will end up in register pressure and it will be slower than as it is now. It also explains why I don't do it with -m 3100, even if it's just pure DES and use it only for -m 1500 and -m 3000 in -a 3 mode.

However, 360M/s on a titan x for a single DES encrypt or decrypt sounds too slow, even with the vanilla DES code. If it's just one round, it should be 1000MH/s or more. That tells me whatever you do before or after the DES, will create too much register pressure and with bitsliced code, your code will run slower than it is now.
#3
(07-14-2016, 12:20 PM)atom Wrote: Bitsliced DES makes sense with brute-force -a 3 mode. Otherwise the register pressure is too strong and your data gets swapped on global memory, making everything slower than it would be without bitslice. 

I've forgot to mention: I'm mainly (only) interested in brute forcing.

Quote:That's why I don't do bitslice with -a0 and -a1 kernel. Or in other words, if your kernel does anything else than DES, then you will end up in register pressure and it will be slower than as it is now.

The algorithm also used a number of hashing algorithms.

Quote:It also explains why I don't do it with -m 3100, even if it's just pure DES and use it only for -m 1500 and -m 3000 in -a 3 mode.

The algorithm also  uses a number of hash calculations.

Quote:However, 360M/s on a titan x for a single DES encrypt or decrypt sounds too slow, even with the vanilla DES code. If it's just one round, it should be 1000MH/s or more. That tells me whatever you do before or after the DES, will create too much register pressure and with bitsliced code, your code will run slower than it is now.

I use three DES encrypt / decrypts + several hash calculation so ~360M sound pretty ok right?

What would you recommend for speed improvement? The main loop is already optimized in a - IMHO - pretty decent way. My first impression is that a < 400 speed boost for 1 high end 3k+ core GPU versus 1 outdated CPU is pretty poor Sad

Thanks, John
#4
Quote:My first impression is that a < 400 speed boost for 1 high end 3k+ core GPU versus 1 outdated CPU is pretty poor

There's no such general statement. I think your speed, if it includes 3xDES, is almost optimal because 3 x 360 > 1000. DES is a special case because of the sbox lookups. Not the memory access is slow, GPU memory is extreme fast, but for each request the pointer calculation creates three additional instructions to pinpoint the correct section in memory. Other hashes like MD5 which do not require such lookups are much faster.
#5
(07-15-2016, 10:28 AM)atom Wrote:
Quote:My first impression is that a < 400 speed boost for 1 high end 3k+ core GPU versus 1 outdated CPU is pretty poor

There's no such general statement. I think your speed, if it includes 3xDES, is almost optimal because 3 x 360 > 1000. DES is a special case because of the sbox lookups. Not the memory access is slow, GPU memory is extreme fast, but for each request the pointer calculation creates three additional instructions to pinpoint the correct section in memory. Other hashes like MD5 which do not require such lookups are much faster.

Thanks for the detailed reply! Given a 3 x DES + 2 x SHA-1 algorithm and attack mode brute force, would you think it's worth it to look into using 'Bitslice DES S-boxes with LOP3.LUT instructions'?

Thanks,

John
#6
No, it will be slower because of the register pressure