Get rid of CUDA
#1
For those who are following the GIT tree we just merged the GetRidOfCUDA branch to master which had a big impact on the code.

Basically, as the name says, we replaced CUDA with OpenCL for NVidia. Discussions that lead to the change can be found here: 

https://github.com/hashcat/oclHashcat/issues/3
https://github.com/hashcat/oclHashcat/issues/4
https://github.com/hashcat/oclHashcat/issues/20

The reasons are:
  • There's no JIT compiler on CUDA. We can't fully accelerate DEScrypt cracking without it, since it requires to have the salt known at compile time
  • Preparation for other OpenCL compatible platform support. ATM we support GPUs only, but this should enable the use of CPU and/or FPGA to run oclHashcat when finished
  • Preparation for restructuring the files to help integration into linux distributions. Having two binaries (oclHashcat and cudaHashcat) is confusing and creates library conflicts
  • Get rid of two packages for oclHashcat. Namely get rid of cudaHashcat. Both AMD and NV users will use oclHashcat64.bin or oclHashcat64.exe
  • Distribute the kernels as source. That should greatly reduce the selection of imperfect binary kernels, especially for low-end GPUs
  • No more need to have two special code bases for AMD and NV, this will reduce maintainance cost
  • No more dependency on CUDA SDK, should help in building. We can use the OpenCL headers from the AMD SDK, they are fully compatible, even cross-platform
  • No more precompilation for the developers of all kernel for all GPU types (this took around a hour for each beta)
  • Reduced package size. For example for NVidia dropped from 89MB to 3MB
While refactorizing I also dropped the support for SIMD code for almost the same reasons:
  • Make the Kernelcode more compact, therefore more portable. This may get even more important when adding other platforms
  • Almost no GPUs are left that require SIMD code to reach full performance, namely AMD 4xxx, 5xxx and 6xxx. While 4xxx was already dropped from catalyst, AMD said they are about to drop support for 5xxx and 6xxx as well
  • In case we really need it back, now with "true" vector datatype support due to OpenCL, even for NVidia, we can use vector datatypes in innerloop kernels
  • Preparation to enable the port of some of the rules that were only useable in CPU, for example the @ Purge rule or the M Memorize rule which then enables append/prepend memory rules
This refactorization really created some work:
  • Half of the kernels dropped in speed before optimizing them for OpenCL + NV. For each kernel it was neccessay to analyze the root causes of performace drops and find solutions
  • NVidias OpenCL runtime does not support C++ code (as AMD does by using -x c++ flag) but a lot of the shared GPU code relied on function overloading etc
  • The HMS code based almost completely on macro-dependant branches which had to be rewritten to true runtime branches. This also had a big impact on the Makefile and the SDK dependancies
  • Dropping the SIMD code
Of course such a big change has also a big impact on performance, but we were able to almost completely work around all performance drops. In return we get some huge speed boosts for some other algorithms:

https://docs.google.com/spreadsheets/d/1...li=1#gid=0

Note that these numbers (especially the red boxes) are not final. I'll continue to find solutions for them in the master branch.

Thanks to philsmd for porting the HMS (Fanspeed, Utilization, Temperature) code portion.
Thanks to dropdead, epixoip, philsmd, Rolf and Xanadrel for help with performance tuning.

--
atom
#2
Hi atom...

Do you have an eta for a new release version incorporating the changes?

Thanks for all of your hard work!

Betawave
#3
If you want to experiment you can always download the latest build at https://hashcat.net/beta/