MD5 speed improvement ideas
#11
again, this is awesome! i've added it to hashcat on complete round 3. speed changed from 9.34M/s -> 9.47M/s. i think i can port it to oclHashcat, too. more infos later.
Reply
#12
works! changes on GTX465 with oclHashcat: 968.6M/s -> 983.8M/s
Reply
#13
Very nice! Difference between .23 and .24

Code:
----Version 0.23----
Threads...: 6
Mode.Left.: Mask '?l?l?l?l' (456976)
Mode.Right: Dict 'example.dict' (129988)
Speed.GPU1: 1009.4M/s (finished)
Speed.GPU2: 1008.8M/s (finished)
Speed.GPU3: 1012.6M/s (finished)
Speed.GPU4:  991.8M/s (finished)
Speed.GPU5: 1013.3M/s (finished)
Speed.GPU6: 1015.4M/s (finished)
Speed.GPU*: [b]6051.3M/s[/b]
Recovered.: 1358/6494 Digests, 0/1 Salts
Progress..: 59400876336/59401396288 (100.00%)
Running...: 10 secs

----Version 0.24----
Threads...: 6
Mode.Left.: Mask '?l?l?l?l' (456976)
Mode.Right: Dict 'example.dict' (129988)
Speed.GPU1: 1097.1M/s (finished)
Speed.GPU2: 1115.3M/s (finished)
Speed.GPU3: 1117.6M/s (finished)
Speed.GPU4: 1100.7M/s (finished)
Speed.GPU5: 1118.8M/s (finished)
Speed.GPU6: 1120.3M/s (finished)
Speed.GPU*: [b]6669.8M/s[/b]
Recovered.: 1358/6494 Digests, 0/1 Salts
Progress..: 59400876336/59401396288 (100.00%)
Running...: 10 secs
Reply
#14
OK, nice to see progress Smile, so we are attacking 10M/s, 1000M/s respectively Wink

Here goes another thing connected with instruction amount reduction... If I'm not wrong, you are doing this in oclHashcat yet. So... In every step of MD5 there is something like this:

a += ((b ^ c) ^ d) + const[n] + password[n];
a = ((a<<const) | (a>>(32-const)) ) + b;

Let's take a look at thist part : const[n] + password[n]
I think that even in the dictionary attack, most of the passwords don't exceed some lenght. For example, let's presume that most of the passwords will be shorten than 20 bytes. It means that most of the time, only password[0..4] will be changing and password[5..14] will contain only zeroes. Why shall PC still adds zero to constant?

So we can have another version of MD5 function, where for n!=0,1,2,3,4,14 will be password[n] = 0, so it can be written without unnecessary addition as:

a += ((b ^ c) ^ d) + const[n];
a = ((a<<const) | (a>>(32-const)) ) + b;

Before starting MD5, you should decide what version to use. It could be done simply by testing

if (password[5]==0) MD5_simpified();
else MD5_full;

In SSE2, you can test if the whole 128-bit password[5] variable is zero, so the result will depend on the longest of four tested passwords.

Of course, 20 chars was only my example, you can do version for different count of characters in password based upon statistics which password lenghts are dominating... (maybe 32 to fit in md5(md5(pass)) ). What do you think?
Reply
#15
added this to oclHashcat on ATI and NV. no speed improvements at all. i guess this was already optimized by compilers. hashcat supports pw length up to 55 chars and i think it should so that means that i can not apply this optimization on hashcat. thanks again! your stuff rock :-)
Reply
#16
(10-17-2010, 10:42 AM)atom Wrote: i guess this was already optimized by compilers. hashcat supports pw length up to 55 chars and i think it should so that means that i can not apply this optimization on hashcat. thanks again! your stuff rock :-)

It CAN be applied, you will still be able to compute 55 char hashes. By implementing the previous idea, you just insert ONLY one IF statement (overhead of approx. 4 instuctions) for each 4 SSE2 packed passwords to decide if full 55 char MD5 or simplified MD5 function can be executed. MD5 simplified for up to 31 chars means 28 LESS instructions, much more than previous tips! If only I can have acces to the source code Smile

Also, it can't be optimized by compiler (still speaking about CPU version), because compiler doesn't know the lenghts of computed passwords - they are known in run-time, not in compile time.

If you want something specific to GPU, here it is Smile I have examined the *.kernel files which are provided with oclHashcat and also examined document ptx_isa_1.3.pdf by nVidia to see the whole instruction set and voila, an idea...
There is MAD instruction (page 51) which performs multiplication and addition, so you can improve the bit rotation operation. Bit shifting by n is the same operation as multiplying by 2^n, so an example written in C:

Code:
    tmp = a >> 25;
    a   = a << 7;
    a   = a | tmp;
can be, of course, written also as
Code:
    tmp = a >> 25;
    a   = a * 128;
    a   = a + tmp;
Last two lines can be computed by one MAD instruction. In CUDA PTX assembly, it could looks like this...
Before optimization:
Code:
    shr.u32     %r201, %r200, 25;
    shl.b32     %r202, %r200, 7;
    or.b32      %r203, %r202, %r201;
Optimized code:
Code:
    shr.u32    %r201, %r200, 25;
    mad.u32    %r202, %r200, 128, %201
Maybe the syntax is wrong... I can't test it, because I'm not familiar with CUDA yet, I bought my first CUDA card - GTX450 just 2 weeks ago Big Grin
Reply
#17
GTS 450 = sm_21 = pain in the ass to get 100% shaders working.
But it's cheap Smile
Reply
#18
Does this apply to open compute language? Because oclhashcat is non cuda based.

Also feel free to join us in irc. You can use web interface if you don't have a client - http://rizon.net/index.php?do=chat
Reply
#19
(10-17-2010, 04:19 PM)Dalibor Wrote: There is MAD instruction (page 51) which performs multiplication and addition, so you can improve the bit rotation operation.
Unfortunately this won't work as integer MAD is just a virtual instruction. PTX isn't real assembler, so it may looks like you're optimizing some instructions but after real compilation to ISA you won't found anything good. As there even no 32-bit integer multiplication for G80-GT200 hardware MAD will be replaced by several 24-bit multiplications and additions (or even back to SHL+ADD if compiler will notice 2^n value).

And for Fermi usual rotate construction like
#define ROTATE(a,n) (((a)<<(n)) | ((a)>>(32-(n))))
will be automatically transformed into
#define ROTATE(a,n) (((a)<<(n)) + ((a)>>(32-(n))))
thus making "fused" SHL+ADD possible.

Shortly, NVCC compiler is smart enough to make rotates as fast as possible, no need for special actions from programmer.

But idea for MD5's 3rd round optimization was nice, never seen it before, thanks Smile.
Reply
#20
(10-17-2010, 08:12 PM)IvanG Wrote: Unfortunately this won't work as integer MAD is just a virtual instruction. PTX isn't real assembler...
Shortly, NVCC compiler is smart enough to make rotates as fast as possible, no need for special actions from programmer.
Thanks for explanation, Ivan. So it was false alarm, same as with BFI_INT instruction, which I have found in AMD Evergreen ISA reference guide, it can compute (A & B) | (~A & c), function bitselect() in OpenCL should be translated into this instruction... But it seems that compilers are better than I presume Smile

And now only theoretical, "academical" tip (maybe usefull for you, Ivan Smile) not applicable to hashcat.
If someone is cracking only 1 (resp. 4) hashes, the common way is to reverse hash down to step 49, then start computing hash from the beginning to step 45, then compare variable A. So we are meeting in the middle in 45th step.
But look at the constant at step 18
Code:
0xc040b340 = 11000000010000001011001101000000
There are six 0's at the end! Step 18:
Code:
d += G(a,b,c) + password[6];
d += ...01000000;
d = ((d<<9) | (d>>23))
d += a;
Last 6 bits of 'd' will not be affected by adding this constant. Then the result is rotated by 9 bits, so after rotation, we know, that bits at positions 15 - 9 (counting from zero) remains the same.
So we can avoid this addition and rotation by meeting in the middle just in this step... Hope you got the idea...

There is of course little overhead needed, but maybe it is possible to save at least one instruction by properly implementing this idea :-)
Reply