An update on RAR: I just finished the ATI RAR kernels yesterday. This is the most difficult algo I've done on GPU until now. It is not hard to code the RAR algo with OpencL, but having a good performing code is very tough job. Until the very end, the scalar version was the fastest one, until I found out a trick to prepare the hash blocks faster. It's all about keeping a balance - you shift data to __local memory, you get lower GPR utilization and then also lower ALUBusy. You vectorize more, you get higher ALUPacking and higher GPR utilization. Also, I had to write 16 separate kernels per each password length 1..16, because there are tweaks that depend on password length. So right now, it does about 9000 c/s on 6870, but due to the slow-performing CPU OpenSSL AES routines, speed gets capped to about 7500 c/s. With RAR archives without header encryption, AES is definitely the bottleneck and the highest speed I get is 1300 c/s.
Probably the solution is two-fold: write a fast GPU kernel and write a fast AES-CBC routine, if possible use SSE2, AES-NI and stuff. For archives with header encryption it's possible to have a second kernel that handles AES decryption in GPUs, unfortunately this is not applicable without header encryption where data to decrypt may be megabytes long.
Also researched 7Zip format. Guess what - the number of block operations is twice as much as RAR and the hash function used is sha256, not sha1. It would be about 4 times slower. Nice. I guess GPUs still don't change much as far as archive passwords are concerned (excluding ZIP where they do help).
Probably the solution is two-fold: write a fast GPU kernel and write a fast AES-CBC routine, if possible use SSE2, AES-NI and stuff. For archives with header encryption it's possible to have a second kernel that handles AES decryption in GPUs, unfortunately this is not applicable without header encryption where data to decrypt may be megabytes long.
Also researched 7Zip format. Guess what - the number of block operations is twice as much as RAR and the hash function used is sha256, not sha1. It would be about 4 times slower. Nice. I guess GPUs still don't change much as far as archive passwords are concerned (excluding ZIP where they do help).