Bitslice status and broken nvcc
For those who aren't following hashcat twitter, I'm experimenting with bitslice for DES/LM (-m 3000) on oclHashcat v1.38. So far everything is working fine but while working on DEScrypt (-m1500) something really strange happend.

I have a test kernel which I am using for my experiments before fully integrating into oclHashcat. I'm using it to identify performance bottlenecks in an early stage. I finalized it and started porting. Ported to AMD, everything is fine. We're at 470mh/s on a single 290x stock clocked (yay worlds fastest!). This is an extreme improvement on which I'm a bit proud because we "only" got 170MH/s on this algorithm so far. Still, I don't know yet how to distribute this kernel because I'm forced to hardcode the salt (and therefor generate 4096 kernels for each architecture). Maybe the distribution as source is the only way to do it. Anyway, I also need to thank Sc00bz for explaining how those e-box are to be used.

Then I started porting to CUDA. But there is the problem, speeds are slower than expected. I guess you all heard the news on 950MH/s with descrypt on a 980Ti etc. I was sceptical with my first implementation so I rewrote the entire thing on CUDA, just to ensure not have a bug somewhere. But no, it turned out there is none. The culprit is nvcc! So what I have here is a kernel who's body is 1:1 the same code on both OpenCL and CUDA (Yes, NVidia has an OpenCL runtime too). When I tested it on NVidias OpenCL, the speed was much better than on their own CUDA?!?! WTH is going on...

To give you some numbers, we're at 73MH/s on CUDA and 110MH/s on OpenCL, measured on a 750Ti. OpenCL speed on a 980Ti is around 350 MH/s. But what I'm trying to say here is that there's something wrong with nvcc compiler. To proof it I had to do some tricks since it's not possible to compile OpenCL code with nvcc but it's possible to dump a OpenCL kernel from NVidias OpenCL runtime! I then have compiled the OpenCL kernel, dumped it, and because it's 1:1 the same code as for CUDA (including the parameters), I was able to load the pure .ptx kernel from cudaHashcat. The resulting speed is about 350 MH/s on CUDA and hashes are cracking.

The problem is the OpenCL runtime for NVidia. There's no way to tell the compiler to generate code for a specific GPU archicture. But due to our binary kernel distribution we really need that feature!

One last thing, I know you're gonna ask: Yes, I'm using lop3 for sboxes. Reported speeds on other projects doing 950MH/s on descrypt or pure DES with lop3 are not reproduceable. Not even the pure sboxes inside a minimalistic kernel on a standalone platform. Feel free to try it yourself. What you really get is 470MH/s on a 290x and 350MH/s on a 980Ti , and just this is some real improvement.
Can the source be reduced to a minimal test case, suitable for sharing with the nvcc team?
In theory yes, but my experience with NV and AMD in the past is that they don't care about bugs. When CUDA was young I've reported a ton of bugs and the same for AMD over the years. It takes alot of effort to pack everything together so that they can reproduce it and then to see nothing happens is very frustrating so I stopped reporting bugs a few years ago.
This is unexpected.
Downgrading to a previous CUDA Toolkit (with Maxwell support) could do the trick, but as far as I remember you're against that, right?
I tried, but 7.0 does not support inline asm for lop3 support
Okay, so this predates 7.0.
Posting this on NV forums is worth a shot, though.
The claimed 967 MH/s (on a 980Ti +250MHz) was using CUDA. Did you try that code with your tool chain? BTW the version that doesn't need a kernel per salt is not that much slower: 826 MH/s under same conditions.

With that speed you'd get 24 GH/s for LM, that would be some record...
OK, News:

1) I got it so far that the speeds are ~ the same as the ones from DeepLearningJohnDoe. That is 766 MH/s vs. 967 MH/s (on a 980Ti +250MHz) using CUDA. Note that we do a markov-optimized search, therefore there is some lost because of the database lookups for each candidate.

2) I solved the fixed salt problem. But it's interessting in how different the speed loss is on 290x and 980Ti. For example the 980Ti drops from 766MH/s to 736MH/s which is "ok", but the 290x... OMG... The reason why I didn't get it working in the first place was because AMD OpenCL SDK strikes back again, but not in a good way.

Fixed salt speed: 470 MH/s

Dynamic salts requires a macro:

#define mysel(a,b,c) ((c) ? a : b) -- Drops 470 -> 33 MH/s !!!

Luckily I played around and found the following workarounds:

#define mysel(a,b,c) (select (a,b,c)) -- Getting 110 MH/s
#define mysel(a,b,c) (bitselect (a,b,(c) ? 0xffffffff : 0)) -- Getting 251 MH/s

So there's a drop from 470 MH/s to 251 for dynamic salt support. I'll take it for now as we also get the multihash support with it.
Good stuff.

#define mysel(a,b,c) ((c) ? a : b) -- Drops 470 -> 33 MH/s !!!

Lulz! That's like... errr it's not like anything.
So, who's first to get 20G for LM on a single card? May the best man win. You are leading so far but you're only half way there.