kernel implementation quirks between a0, a1, a3
When implementing another custom kernel again, I notice that I am again sometimes struggling with 'obvious' swaps of either pws or difference between the vectorized (mainly a3) and non vectorized approaches (a0, a1).

For example (I do not pretend to  know what all kernel OPTS_TYPES do underwater, so I mainly set them just zero when implementing a kernel)

So I noticed that without the OPTS_TYPE_PT_GENERATE_BE the results of sha512_final_vector() isn't consistent with the non-vectorized sha512_final(). Meaning, not using this OPTS_TYPE creates a different result in sha512_final_vector().

In the a1 and a3 kernels I noticed pws are swapped, but in the a0 they are straightforward. Which means I needed an extra swap in the implementation of the a1 and a3 kernels.

I can understand that some algorithms are either designed on either BE or LE, and to some degree I understand some implementation ways might be more efficient on OpenCL. Just trying to get my head around why these differences between the kernel attack modes exist in hashcat.

If anyone can help me understand these quirks, I would appreciate it.
It's all about optimizations. Whenever possible we try to avoid swaps in the kernel, especially in fast hash kernels. For pure kernels, we need to stick to the original idea of having an openssl-like interface, otherwise it would not make sense to have it in the first place. Therefore the OPTS_TYPE_PT_GENERATE_BE gets removed automatically in case you're not using -O manually, that's why you need to have dedicated code for it.