New architecture, different sched
#1
OK, here's a nice case about how the oclHashcat-plus refactorization changes many things.

In a good way!

So, I'm still working on the rule-engine to improve the performance and then i did the following test:

I've generated a big set of rules using maskprocessor with this command:

Code:
$ mp64.bin -o rule "^?s ^?l ^?d"

This rule is not really useful in practice, but in development for benchmarks. It produces around 20k rules. Enough to work with.

Now I run this set with v0.14 using rockyou.txt against 500k MD5 hashes. On my 2x hd6990 I'm getting:

Quote:Speed.GPU.#*...: 4838.9 MH/s

Compared to new version:

Quote:Speed.GPU.#*...: 4218.1 MH/s

So yeah I made some additional improvements since this is "only" a 12.8% loss.

This is for 2 reasons:

1. Because the ^ rule im using the the example ruleset is a very cheap rule.

2. When I started the old version, I was getting this:

Quote:NOTE: autotuned --gpu-accel from 320 to 64

The new version does not face this problem. Because it does not have to divide the number of words from the input dictionary by an average password length of 8.

That means it can run at higher -n value and this compensate some of the loss, additionally.

However, I also meassured the time both commands take to executhe the run.

And now the surprising part comes. The new versions finishes faster. But why is that?

Is the old version displaying a wrong speed? No, it's not!

Both versions, when you hit the "s" key, show the current speed.

The problem is when the dictionary comes to the end. You might have noticed that when a run finished, it takes longer for the last percentages of a list.

That is because the old version had to flush unfinished buffers of words from the dictionary. But those unfinished buffer, in theory, can hold only 1 word. So oclHashcat-plus reduces the number of workload for the GPU calculation BUT it still has all the overhead like copying the buffer, running at least X thread (64 for AMD) so its not totally possible to reduce the workload that much that it will only take that time it would take for a full buffer divided by the remaining words. Therefore, v0.14 oclHashcat-plus will take VERY long till it finishes.

So it happens that the old version took that long:

Quote:real 1m55.290s
user 0m12.849s
sys 0m5.484s

Now, the new version does not have all these problems. It does not have to hold multiple number of caches, one for each wordlength. It has a cache, but its of the size number-of-SP * number-of-threads * gpu-accel * native-vector-size-of-gpu. That sounds big but the same buffer had the old version, but 15 times (thus is was only supporting legnth 15 and it would take massive amount of ram to increase with old architecture).

To make story short, new version finished faster:

Quote:real 1m24.884s
user 0m9.693s
sys 0m3.312s

Now, isn't that cool?


Messages In This Thread
New architecture, different sched - by atom - 06-12-2013, 11:03 AM