12-17-2010, 03:17 PM
(12-17-2010, 10:02 AM)gat3way Wrote: I find it a bit strange that 1), 4) and 5) are slower than 3) though. You have one bitwise operation less, however it should be worse as there is one more instruction dependency (b depends on c^d, then d depends on b&(c^d) while in 1),4) and 5) (b&c) and ((~b)&d) can be processed in parallel). What is more strange, both behaved the same on CPU, even though more dependencies would cause a pipeline bubble.
CPU version: If you want to use variable "b" later, you can't overwrite it, so you must make a copy...
So the first case will translate into something like this:
Code:
movdqa tmp1, b
movdqa tmp2, b
pand tmp1, c
pnand tmp2, d
por tmp1, tmp2
option 3 will produce bigger dependency chain, but one instruction less:
Code:
movdqa tmp, d
pxor tmp, c
pand tmp, b
pxor tmp, d
On GPU there is a different problem - absence of single NAND instruction, so moreover you must compute bitwise NOT.
But maybe I'm wrong, please correct me, don't have much time now...