[split] BFI_INT from OpenCL
#9
Tested yesterday, both are the same as 1) on GPU and on CPU.

I find it a bit strange that 1), 4) and 5) are slower than 3) though. You have one bitwise operation less, however it should be worse as there is one more instruction dependency (b depends on c^d, then d depends on b&(c^d) while in 1),4) and 5) (b&c) and ((~b)&d) can be processed in parallel). What is more strange, both behaved the same on CPU, even though more dependencies would cause a pipeline bubble. But then, I interlace several MD5 operations thus utilizing the pipeline, probably that's why I see no difference. I may try to test that without interlacing, but that would require rewriting a lot of stuff just for the test.

I don't interlace md5 in my GPU code though and that behavior seems a bit irrational. Then I am very far away from knowing the ATI GPU internals, this probably has some simple explanation.

P.S another weird thing is that going from uint4s to uint8s gave a strong performance boost on the GPU code, about 20%. I don't think better VLIW5 utilization can explain that thoroughly, that's just another ATI GPU paradox I cannot explain.
Reply


Messages In This Thread
[split] BFI_INT from OpenCL - by gat3way - 12-15-2010, 12:21 AM
RE: [split] BFI_INT from OpenCL - by gat3way - 12-17-2010, 10:02 AM
RE: [split] BFI_INT from OpenCL - by Dalibor - 12-17-2010, 03:17 PM
RE: AMD Stream 2.3 SDK released - by atom - 12-15-2010, 09:09 AM
RE: AMD Stream 2.3 SDK released - by IvanG - 12-15-2010, 05:48 PM
RE: AMD Stream 2.3 SDK released - by atom - 12-15-2010, 07:33 PM
RE: AMD Stream 2.3 SDK released - by IvanG - 12-15-2010, 08:24 PM
RE: AMD Stream 2.3 SDK released - by atom - 12-15-2010, 09:22 PM
RE: AMD Stream 2.3 SDK released - by gat3way - 12-15-2010, 11:06 PM
RE: AMD Stream 2.3 SDK released - by Dalibor - 12-16-2010, 04:21 PM