12-17-2010, 03:17 PM 
		
	
	(12-17-2010, 10:02 AM)gat3way Wrote: I find it a bit strange that 1), 4) and 5) are slower than 3) though. You have one bitwise operation less, however it should be worse as there is one more instruction dependency (b depends on c^d, then d depends on b&(c^d) while in 1),4) and 5) (b&c) and ((~b)&d) can be processed in parallel). What is more strange, both behaved the same on CPU, even though more dependencies would cause a pipeline bubble.
CPU version: If you want to use variable "b" later, you can't overwrite it, so you must make a copy...
So the first case will translate into something like this:
Code:
movdqa tmp1, b
movdqa tmp2, b
pand   tmp1, c
pnand  tmp2, d
por    tmp1, tmp2option 3 will produce bigger dependency chain, but one instruction less:
Code:
movdqa tmp, d
pxor   tmp, c
pand   tmp, b
pxor   tmp, dOn GPU there is a different problem - absence of single NAND instruction, so moreover you must compute bitwise NOT.
But maybe I'm wrong, please correct me, don't have much time now...
 
 

 
