MD5 speed improvement ideas
#1
Hi Atom!

I think that you can optimize MD5 algorithm a bit. Daniel Niggebrugge (EmDebr) nor Michail Svarychevski (barsWF) didn't reply me, so i'm trying you Smile

Here comes one of the optimizations (and very little one) of sse2 code.
In steps 35, 39, 43 and 47 (round 3 od MD5 where rotating by 16 bits is performed), you can use:

a = _mm_shufflehi_epi16(a, 0xB1);
a = _mm_shufflelo_epi16(a, 0xB1);

instead of

tmp = _mm_slli_epi32(a, 16);
a = _mm_srli_epi32(a, 32-16);
a = tmp | a;

Or if you are using assembler, the above means that you can use

PSHUFHW xmm0, xmm0, 0xB1
PSHUFLW xmm0, xmm0, 0xB1

for rotating xmm0 register by 16 bits rather than

MOVDQA xmm7, xmm0
PSLLD xmm0, 16
PSRLD xmm7, 16
por xmm0, xmm7

It means you can save 2 instructions and 1 'temporary' xmm register. Round3 includes 4 these operations, so you can totally save up to 8 instructions Smile

I think that there are more improvements which could be done, but without source code of your app, it is enough for now... Wink

hope it helps, have a nice time
Dalibor
Reply
#2
Nice Dalibor, did you find this yourself or read it somewhere?
Reply
#3
It is my own little idea, but maybe i am reinventing the wheel Smile
Reply
#4
Dalibor:

It looks impressive!!
Reply
#5
man, this is nice. thanks for the explanation and thanks for placing this on our forum. you should come to our irc channel! i've added it to hashcat v0.35. here is how it changed: 9.17M/s -> 9.34M/s. thanks!
Reply
#6
Hi, it's nice to hear from you that my idea was usefull (even only a little bit, but everything counts Wink ) Thanks. Here is another tip. Very simple and 'standard'.

You can get rid of adding constant to constant at each pass... For example at first step of MD5, everything except the first part of password is known.

In the first step, there is :

a += (const ^ (const & (const ^ const))) + const + password[0];
a = ((a<<const) | (a>>const)) ) + const;

You should precompute as much as possible. So it means that you can avoid initializing a,b,c,d variables by values 0x67452301, 0xEFCDAB89 etc. then adding, xoring them with another constants etc. First four steps for example can look like this:

a = 0xd76aa477 + password[0];
a = ((a<<7) | (a>>25) ) + 0xEFCDAB89;

d = (0x98BADCFE ^ (a & 0x77777777)) + 0xf8fa0c4c + password[1];
d = ((d<<12) | (d>>20) ) + a;

c = ((d & a) | (~d & 0xEFCDAB89)) + 0xbcdb4dd9 + password[2];
c = ((c<<17) | (c>>15) ) + d;

b = ((c & d) | (~c & a)) + 0xb18b7a77 + password[3];
b = ((b<<22) | (b>>10) ) + c;

Hope I didn't do any 'typo', but I think that you got the point Wink

The same can be done from the end - hashes could be precalculated (reversed) much or less, depending on type of attack, but it is another story... Smile

Is it possible to have access to your source code?
Reply
#7
this is something i already implemented long time ago. since it gives no performance improvement i removed it (i prefer to have a clean and easy to read source). i guess the compiler already optimized it.

i reimplemented your version just to double check the result, but still no change in speed. i rewrote it to sse2 and also found that the constant 0xf8fa0c4c is wrong. the correct value is be 0xf8fa0bcc.

i also know that md5 hashes can be reversed (partially), but this optimization is incompatible to hashcat or oclHashcat since they are designed for wordlist-based attacks and not brute-force.

there are still some optimizations left in hashcat like not adding A,B,C,D to a,b,c,d at the end or not running the last 3 steps if A is not found in bitmap.

hashcat is a closed-source project. i am not willing to make it open-source yet, sorry.
Reply
#8
(10-11-2010, 12:37 PM)atom Wrote: since it gives no performance improvement i removed it (i prefer to have a clean and easy to read source). i guess the compiler already optimized it.

No change? Strange. You are right that 'clever' compiler should optimize it, but it is better to check the generated assembly Wink

I am not able to officially do that due to your licence, of course...

P.S. Sorry for bad constant...
Reply
#9
dont get me wrong. this was very constructive stuff. i hoped more threads like this would pop in :-) thanks again for your improvement tips.
Reply
#10
Hi, sorry, i don't have much time now, so briefly another tip. Again, 'clever' compiler may optimize it for you, but your code doesn't look like compiler do the job well.

First two steps of round 3:
a += ((b ^ c) ^ d) + const + pass;
a = ((a<<const) | (a>>(32-const)) ) + b;

d += (a ^ (b ^ c)) + const + pass;
d = ((d<<const) | (d>>(32-const)) ) + a;

Optimized first two steps of round 3:
tmp = b ^ c;
a += (tmp ^ d) + const + pass;
a = ((a<<const) | (a>>(32-const)) ) + b;

d += (tmp ^ a) + const + pass;
d = ((d<<const) | (d>>(32-const)) ) + a;

You can rewrite all pair steps in round 3 in that manner (i.e. computing tmp in each (2n)th step and use it in each (2n+1)th step. Just for sure, variable tmp should be declared as 'register' to avoid writing and reading 'tmp' to/from memory.

I'm not familiar with IRC yet, so maybe I'll join later. I'm using jabber for now.
Reply