10-09-2010, 08:54 PM
Hi Atom!
I think that you can optimize MD5 algorithm a bit. Daniel Niggebrugge (EmDebr) nor Michail Svarychevski (barsWF) didn't reply me, so i'm trying you
Here comes one of the optimizations (and very little one) of sse2 code.
In steps 35, 39, 43 and 47 (round 3 od MD5 where rotating by 16 bits is performed), you can use:
a = _mm_shufflehi_epi16(a, 0xB1);
a = _mm_shufflelo_epi16(a, 0xB1);
instead of
tmp = _mm_slli_epi32(a, 16);
a = _mm_srli_epi32(a, 32-16);
a = tmp | a;
Or if you are using assembler, the above means that you can use
PSHUFHW xmm0, xmm0, 0xB1
PSHUFLW xmm0, xmm0, 0xB1
for rotating xmm0 register by 16 bits rather than
MOVDQA xmm7, xmm0
PSLLD xmm0, 16
PSRLD xmm7, 16
por xmm0, xmm7
It means you can save 2 instructions and 1 'temporary' xmm register. Round3 includes 4 these operations, so you can totally save up to 8 instructions
I think that there are more improvements which could be done, but without source code of your app, it is enough for now...
hope it helps, have a nice time
Dalibor
I think that you can optimize MD5 algorithm a bit. Daniel Niggebrugge (EmDebr) nor Michail Svarychevski (barsWF) didn't reply me, so i'm trying you
Here comes one of the optimizations (and very little one) of sse2 code.
In steps 35, 39, 43 and 47 (round 3 od MD5 where rotating by 16 bits is performed), you can use:
a = _mm_shufflehi_epi16(a, 0xB1);
a = _mm_shufflelo_epi16(a, 0xB1);
instead of
tmp = _mm_slli_epi32(a, 16);
a = _mm_srli_epi32(a, 32-16);
a = tmp | a;
Or if you are using assembler, the above means that you can use
PSHUFHW xmm0, xmm0, 0xB1
PSHUFLW xmm0, xmm0, 0xB1
for rotating xmm0 register by 16 bits rather than
MOVDQA xmm7, xmm0
PSLLD xmm0, 16
PSRLD xmm7, 16
por xmm0, xmm7
It means you can save 2 instructions and 1 'temporary' xmm register. Round3 includes 4 these operations, so you can totally save up to 8 instructions
I think that there are more improvements which could be done, but without source code of your app, it is enough for now...
hope it helps, have a nice time
Dalibor