How does 15 character limitation help in a speed up?
#1
Hi,

I know that this topic has been discussed a lot and many people have requested atom to add support in oclHashcat-plus for passwords of length greater than 15.

While all that is good, I have not seen any specifics from a cryptography point of view about how does limiting the plaintext length to 15 help in speeding it up?

I read somewhere a post by epixoip where he mentioned that one of the reasons oclHashcat-plus is so fast is because of this limitation. It does not perform all the iterations involved in the algorithm and skips a few.

So, I would like to understand this in more depth.

As an example, I have taken MD5 algorithm.

MD5 operates on blocks of data (each block having a size of 64 bytes or 512 bits). Even if input data is lesser than this size, padding will be done to bring the total length up to 512 bits.

4 Rounds are involved in MD5.

In each Round, 16 operations are involved.

Each operation in each Round is performed on the 32 bit word of the Input Data in sequence like:

Round #1: Operation 1 on the 1st 32 bit word of the Input Data.
Round #1: Operation 2 on the 2nd 32 bit word of the Input Data.

and so on ...

Similarly in the second round and the remaining rounds, Operations are once again performed in sequence on 32 bit words of the Input Data (starting once again from the first 32 bit word of Input Data)

Now, coming to oclHastcat-plus's limitation of 15 character passwords.

15 character password (considering ASCII) would mean an input data size of 15 bytes (120 bits). Please correct me if I am wrong.

So, the remaining 328 bits are padded with (10000....) so that length is 448 bits.
And the last 64 bits are filled with the 64 bit representation of the length of input message (120 bits)

so, essentially we are once again going to operate on 512 bits of data.

Are some operations in each round not calculated to help in speeding up?

Since many of the 32 bit words of the input data would consist of only 0 bits as a result of padding?

Which operations in each round are being ignored to help in speed up?

Please note that I am only trying to understand this from a cryptography point of view and not as a feature request (which has already been put here by others many times).

Any help in understand this would be great. Because to me, it looks like, even if we are restricting the length of the input plain text, the resulting MD5 is being performed on 512 bits (which is greater than the length of input data, plain text password in our case).
#2
Making assumptions about the input size allows us to do length-specific optimizations.

Using MD5 as an example, an MD5 block only holds 64 bytes of input data if you are using a multi-block implementation. If you do a single block implementation with with no multi-block support, then you can only support 55 bytes of input data (block[0]-block[13] hold the 55 bytes of input data plus the 0x80 terminator, block[14] holds the length, and block[15] is unused.) So by assuming an input of 55 bytes or less, we can gain quite a bit of speed by optimizing out all of the bits that are only used in a multi-block implementation.

But we can do better than simply not using a multi-block implementation by making further assumptions about the input size. By assuming an input size of up to 15 bytes (which becomes 16 bytes when you append the 0x80 terminator), we only need to operate on four 32-bit words. We can skip all that padding that you mention, because it is simply not needed. This allows the entire algorithm to fit nicely into registers without spilling. And since we skipped all that padding, we can ALSO optimize out 40 add instructions on that null data.

If you would like something to study, have a look at http://bindshell.nl/pub/md5substr_simd.c (one of my rare examples of public code), in which I optimize for length 8 (as there's absolutely no reason to support brute force for longer lengths on CPU.)
#3
The main optimization idea bases on the fact that we have W[5] - W[13] (in MD5) set to zero. If its zero there is no need to add the secret key to the intermediate digest because X + 0 = X. For each step in which these variables are used in we can remove the call to ADD.
#4
NeonFlash, here's an illustration of how much this optimization is good for.

Using the example code above, optimized for length 8:
Code:
epixoip@ike:$ ./md5substr_len8 babeface
Using 4 threads, 12x XOP
Elapsed: 17s  Progress: 1935998064/377149515625 (0.5%),  Speed: 117.33 M/s virt, 113.88 M/s real
babeface:>2HX2l

Same code, modified for a full single block (not optimizing out the ADDs):
Code:
epixoip@ike:$ ./md5substr_len55 babeface
Using 4 threads, 12x XOP
Elapsed: 29s  Progress: 2015997984/404567235136 (0.5%),  Speed: 74.67 M/s virt, 71.11 M/s real
babeface:>2HX2l

So by optimizing out the ADDs we gained 40%. And mind you this is on a low-end CPU (FX-4100.)
#5
Thanks a lot for the detailed explanation, epixoip and atom.

so if I understand this correct:

Whatever is the size of input data, the 64 bit representation of the length will always be appended to the data.
padding with '1' followed by a sequence of 0's is only required if the length of input data (including 0x80 terminator) is not 448 bits, 960 bits and so on.

since appending the length of the data is mandatory, so the max size of data we can process in a single block is 55 bytes.

for any data size greater than 55 bytes, we need a multi block implementation?

you mentioned that if size of input data is 16 bytes (15 bytes + 0x80 terminator), then the padding is not required?

1. we don't need to pad with 1 and 0's to bring the length upto 448 bits as required by the design?
2. we don't need to pad the length of input data (16 bytes)?

Or does it mean the following:

w[0] - 1st 32 bit word of our input data
w[1..3] - 2nd, 3rd and 4th 32 bit words of input data

0x80 byte will be stored in w[3]

Now, we need to append the bit 1 followed by 0's till length comes upto 448 bits.

we need 10 more 32 bit words to bring the length to 448 bits.

w[4] - will have the most significant bit set to 1, followed by 0s.
w[5] - w[13] - all bits are set to 0
w[14] - holds the length

now, from the C implementation of MD5 as mentioned on wiki (http://en.wikipedia.org/wiki/MD5)

b = b + LEFTROTATE((a + f + k[i] + w[g]), r[i]);

we are saving on the addition of w[g] (for the values of g from 5 to 13)

in each round, we are saving 9 ADD instructions
so a total of 9*4 = 36 ADD instructions are saved.

have I understood this correct?

Thanks for the code of md5substr_simd.c, I will study it Smile
#6
(03-10-2013, 04:42 PM)NeonFlash Wrote: Whatever is the size of input data, the 64 bit representation of the length will always be appended to the data.

32bit, there are not any 64bit words in MD5.

(03-10-2013, 04:42 PM)NeonFlash Wrote: padding with '1' followed by a sequence of 0's is only required if the length of input data (including 0x80 terminator) is not 448 bits, 960 bits and so on.

I'm not sure what you mean by this. The message is broken up in 16 32-bit words. If your message is only e.g. 128 bits in size, then the rest of the words are all 0, up to word 14. This is what we're taking advantage of.

(03-10-2013, 04:42 PM)NeonFlash Wrote: for any data size greater than 55 bytes, we need a multi block implementation?

Yes, correct.

(03-10-2013, 04:42 PM)NeonFlash Wrote: you mentioned that if size of input data is 16 bytes (15 bytes + 0x80 terminator), then the padding is not required?

1. we don't need to pad with 1 and 0's to bring the length upto 448 bits as required by the design?
2. we don't need to pad the length of input data (16 bytes)?

For any length < 55 bytes the padding is not required, because we can simply optimize out any operations on the null words. But if we optimize for length 15 and only have 8 bytes of input, then we do need to do a bit of padding (just 7 bytes.)

(03-10-2013, 04:42 PM)NeonFlash Wrote: Or does it mean the following:

w[0] - 1st 32 bit word of our input data
w[1..3] - 2nd, 3rd and 4th 32 bit words of input data

0x80 byte will be stored in w[3]

Yes. w[0] holds chars 1-4, w[1] holds 5-8, w[2] holds 9-12, w[3] holds 13-15 plus 0x80. w[14] holds the actual length of the input data.

(03-10-2013, 04:42 PM)NeonFlash Wrote: Now, we need to append the bit 1 followed by 0's till length comes upto 448 bits.


Nope, this is what we're optimizing out. We don't need to pad, and we don't need to operate on the padded data.

(03-10-2013, 04:42 PM)NeonFlash Wrote: now, from the C implementation of MD5 as mentioned on wiki (http://en.wikipedia.org/wiki/MD5)

This probably wouldn't be best implementation to study for password cracking, as this a full, un-optimized, multi-block implementation.

(03-10-2013, 04:42 PM)NeonFlash Wrote: b = b + LEFTROTATE((a + f + k[i] + w[g]), r[i]);

we are saving on the addition of w[g] (for the values of g from 5 to 13)

in each round, we are saving 9 ADD instructions
so a total of 9*4 = 36 ADD instructions are saved.

Yes, we optimize out the addition of w[g] in this example, but actually for w[4] - w[13], for a total of 40 ADDs eliminated.
#7
Sorry for my late reply. Thanks for the explanation, epixoip.

I read your code little bit in my free time to understand the optimizations done for the functions, F, G, H and I using SIMD.