The Autotune Engine
#1
The Autotune Engine

If you're using the latest beta version/GitHub you get the following new notification on program start:

Quote:Device #1: autotuned kernel-accel to 1024
Device #1: autotuned kernel-loops to 200

This is a new system which tries to automatically find out the best workload tuning for your attack.

TL;DR

If you're not interessted in all the details, all you need to know is that:
  • The way how to tune oclHashcat changed
  • The autotune engine is always active, but you can override it
  • The definition of the -w (workload-profile) parameter changed. Now -w is your friend
  • If you're not a developer, forget the parameter -u and -n 
The goal was that the user can use this single parameter to easily set how he want the autotune engine to make use of the available resources. The profiles are defined as following:
  • -w 1: Watch movies or play games (even such that require high FPS count), lowest performance
  • -w 2: Normal desktop operations or "economic" mode (like internet browsing, texteditor, etc), default setting
  • -w 3: Headless systems or dedicated cracking systems, highest performance
Don't forget, don't use -n or -u anymore!

If you don't care about desktop lags, simply add -w 3 to your commandline

Why the change?

In the past I've hardcoded some preconfigured values for kernel-accel (-n) and kernel-loops (-u), depending on hash-type and GPU vendor. They were set to a value which is optimized to run best on the current high-end GPU in Brute-Force attack-mode. This was an suboptimal solution, because these values need to be changed in case you use a different attack-mode or a low-end GPU. 

Therefore I've made the parameters -n and -u available to the commandline so you can adjust them for your specific case. Later I found out that there's a fixed relation between the optimal values for the different attack-modes, but not depending on the hash-types. Now it was possible to make this process more easy for the user, so I've added the --workload-profile (or -w) parameter. This parameters definition has changes with the new version, that's why I don't want to go to much into the details of how it was used before, that would just confuse you.

Now, with latest oclHashcat version which supports all OpenCL compatible device types like CPU or other accelerators, such a fixed value doesn't fit any longer. They are so different in how they are designed, they require their own strategy to find the best values for them. 

I was forced to rethink about how to find the optimal settings. The first change was to fully get rid of the hardcoded values in oclHashcat and move them into a user-configurable text database. In that database you can set these ideal tuning values for each device, attack-mode and hash-type. But it quickly turned out that this database becomes huge and such databases are typically too hard to control and end up as a still birth. There was simply no way around an automatic solution and that's how the idea of an autotuning engine turned up.

How to use it?

You don't need to active it or do anything in general to make use of it. The autotune engine is always active, whenever you start oclHashcat. Since every automatism can create errors, because of some unknown variable or it lacks of informations, it's required to have a mechanism that can override whatever it calculates. There's two ways to override the autotune engine:
  • Set --opencl-vector-width, -n and -u by hand
  • The combination of device-name, attack-mode and hash-type match an entry in the tuning database
Tuning database?

The tuning database is a simple textfile, of which all entries in a line are separate with tabs or spaces, a CSV. Here's some rules:
  • This file is used to override autotune settings
  • This file is used to preset the Vector-Width, the Kernel-Accel and the Kernel-Loops Value per Device, Attack-Mode and Hash-Type
  • A valid line consists of the following fields (in that order):
    • Device-Name
    • Attack-Mode
    • Hash-Type
    • Vector-Width
    • Kernel-Accel
    • Kernel-Loops
  • The first three columns define the filter, the other three is what is assigned when that filter matches
  • If no filter matches, autotune is used
  • Columns are separated with one or many spaces or tabs
  • A line can not start with a space or a tab
  • Comment lines are allowed, use a # as first character
  • Invalid lines are ignored
  • The Device-Name is the OpenCL Device-Name. It's shown on oclHashcat startup.
  • If the device contains spaces, replace all spaces with _ character.
  • The Device-Name can be assigned an alias. This is useful if many devices share the same chip
  • The use of wildcards is allowed, some rules:
    • Wildcards can only replace an entire Device-Name, not parts just of it. eg: not Geforce_*
    • The policy is local > global, means the closer you configure something, the more likely it is selected
    • The policy testing order is from left to right
  • Attack modes can be:
    • 0: Dictionary-Attack
    • 1: Combinator-Attack, will also be used for attack-mode 6 and 7 since they share the same kernel
    • 3: Mask-Attack
  • The Kernel-Accel is a multiplier to OpenCL's concept of a workitem, not the workitem count
  • The Kernel-Loops has a functionality depending on the hash-type:
    • Slow Hash: Number of iterations calculated per workitem
    • Fast Hash: Number of mutations calculated per workitem
  • None of both should be confused with the OpenCL concept of a "thread", this one is maintained automatically
  • The Vector-Width can have only the values 1, 2, 4, 8 or 'N', where 'N' stands for native, which is an OpenCl-queried data value
  • The Kernel-Accel is limited to 1024
  • The Kernel-Loops is limited to 1024
  • The Kernel-Accel can have 'A', where 'A' stands for autotune
  • The Kernel-Loops can have 'A', where 'A' stands for autotune
Personal tuning settings

The tuning database can also be used to store your personal tuning settings you like. For example if you want to go full power you can simply add an entry like this:

* * * N 1024 1024

But you have to live with all the implications this generates. High power consumption, extreme heat development, far distant restore checkpoints, slow speed updates, laggy desktop etc. Generally this is not what you want.

It makes much more sense to fine-tune the settings. To give you an idea of how to do it, here's how I do it:

Preparation
  • set your fanspeed to 100% (if applicable)
  • set your power limit to 100% (if applicable)
  • set your core clock to stock settings
  • set your memory clock to stock settings
  • for every run, give it time to settle down, that is when it seems to have reached a speed that doesn't increase anymore
  • use a single hash for testing, if you need an example hash for the different algorithms, see the wiki pages example hashes
  • Attack-Mode 0:
    • Choose your favourite wordlist (should have > 10m words)
    • Choose your favourite ruleset (should have > 2000 rules)
    • Example: oclHashcat64.exe -a 0 hash.txt wordlist.txt -r rules\rockyou-30000.rule -m xxx -u xxx -n xxx --opencl-vector-width xxx
  • Attack-Mode 1:
    • Choose your favourite wordlist (should have > 10m words)
    • Choose ?a?a?a as mask
    • Example: oclHashcat64.exe -a 6 hash.txt wordlist.txt ?a?a?a -m xxx -u xxx -n xxx --opencl-vector-width xxx
  • Attack-Mode 3:
    • Choose ?b?b?b?b?b?b?b as mask
    • Example: oclHashcat64.exe -a 3 hash.txt ?b?b?b?b?b?b?b -m xxx -u xxx -n xxx --opencl-vector-width xxx
Measurement

  1. Find kernel accel
    • Set vector width to 1
    • Set kernel accel to 1024
    • Set kernel loops to 1
    • Decrease kernel accel by dividing by two until gpu utilization settles down below 95%
  2. Find vector width
    • Set vector width to 1
    • Set kernel accel to the previous value
    • Set kernel loops to 1
    • Try the 4 different vector width 1, 2, 4 and 8 and use the one with the lowest exec runtime
  3. Find kernel loops:
    • Set vector width to the previous value
    • Set kernel accel to the previous value
    • Set kernel loops to 1
    • Increase kernel loops in steps of 8 until execution time is closest to 64ms (in status screen)
#2
Also linked to this post from the wiki here:

https://hashcat.net/wiki/doku.php?id=autotune
~
#3
It's awsome! But, I find that when I use the -w3, the speed of rar3 cracking is improved about 10times.! Could you please explain how the autotune achieve this?