Thanks a lot for the detailed explanation, epixoip and atom.

so if I understand this correct:

Whatever is the size of input data, the 64 bit representation of the length will always be appended to the data.

padding with '1' followed by a sequence of 0's is only required if the length of input data (including 0x80 terminator) is not 448 bits, 960 bits and so on.

since appending the length of the data is mandatory, so the max size of data we can process in a single block is 55 bytes.

for any data size greater than 55 bytes, we need a multi block implementation?

you mentioned that if size of input data is 16 bytes (15 bytes + 0x80 terminator), then the padding is not required?

1. we don't need to pad with 1 and 0's to bring the length upto 448 bits as required by the design?

2. we don't need to pad the length of input data (16 bytes)?

Or does it mean the following:

w[0] - 1st 32 bit word of our input data

w[1..3] - 2nd, 3rd and 4th 32 bit words of input data

0x80 byte will be stored in w[3]

Now, we need to append the bit 1 followed by 0's till length comes upto 448 bits.

we need 10 more 32 bit words to bring the length to 448 bits.

w[4] - will have the most significant bit set to 1, followed by 0s.

w[5] - w[13] - all bits are set to 0

w[14] - holds the length

now, from the C implementation of MD5 as mentioned on wiki (http://en.wikipedia.org/wiki/MD5)

b = b + LEFTROTATE((a + f + k[i] + w[g]), r[i]);

we are saving on the addition of w[g] (for the values of g from 5 to 13)

in each round, we are saving 9 ADD instructions

so a total of 9*4 = 36 ADD instructions are saved.

have I understood this correct?

Thanks for the code of md5substr_simd.c, I will study it

so if I understand this correct:

Whatever is the size of input data, the 64 bit representation of the length will always be appended to the data.

padding with '1' followed by a sequence of 0's is only required if the length of input data (including 0x80 terminator) is not 448 bits, 960 bits and so on.

since appending the length of the data is mandatory, so the max size of data we can process in a single block is 55 bytes.

for any data size greater than 55 bytes, we need a multi block implementation?

you mentioned that if size of input data is 16 bytes (15 bytes + 0x80 terminator), then the padding is not required?

1. we don't need to pad with 1 and 0's to bring the length upto 448 bits as required by the design?

2. we don't need to pad the length of input data (16 bytes)?

Or does it mean the following:

w[0] - 1st 32 bit word of our input data

w[1..3] - 2nd, 3rd and 4th 32 bit words of input data

0x80 byte will be stored in w[3]

Now, we need to append the bit 1 followed by 0's till length comes upto 448 bits.

we need 10 more 32 bit words to bring the length to 448 bits.

w[4] - will have the most significant bit set to 1, followed by 0s.

w[5] - w[13] - all bits are set to 0

w[14] - holds the length

now, from the C implementation of MD5 as mentioned on wiki (http://en.wikipedia.org/wiki/MD5)

b = b + LEFTROTATE((a + f + k[i] + w[g]), r[i]);

we are saving on the addition of w[g] (for the values of g from 5 to 13)

in each round, we are saving 9 ADD instructions

so a total of 9*4 = 36 ADD instructions are saved.

have I understood this correct?

Thanks for the code of md5substr_simd.c, I will study it