Can anyone explain how hashcat is able to crack Scrypt via CPU without using tons of
#1
memory, but it is not able to use GPU's to do the same operation?
#2
the short answer is "parallelization".

You need to keep in mind that it only makes sense to use GPUs because they can parallalize things enormously good, i.e. the power of the GPUs comes only from running all those cores in parallel.
(the FAQ, in my opinion, summarizes this very good "Those small compute devices on GPU (shader) they are relatively slow and dumb compared to a CPU. ... What makes a GPU so fast is that there is a lot of those slow and dumb shaders. That means to make use of it, we have to parallelize the problem.", see https://hashcat.net/faq#how_to_create_mo...full_speed).

The problem with scrypt is that we need a lot of memory for each scrypt computation i.e. each compute unit ("core") needs a relatively huge amount of VRAM. This is a main property of scrypt, i.e. to make it GPU (and fpga etc) unfriendly (note: this also depends a lot on the scrypt settings: N, r, p).

There are other problems with scrypt that are especially notable for GPU devices, like that OpenCL memory allocation have some limits (especially notable on Nvidia, 1/4 of VRAM, but hashcat works around this limit about how much memory can be allocated at once by allocating several blocks of memory)...
Furthermore, the scrypt tmto setting is used to work around some memory allocation limits, but it comes with some disadvantages too (e.g. if you want to use less memory, the speed will drop too etc).

There are many technical explanations/facts why scrypt is GPU-unfriendly. GPUs getting more and more VRAM might help a little bit in the future, but beware that scrypt has some cost factors (N, r, p) and therefore one could just increase the cost and make it slow to crack again.

It is needless to say that without the parallelization (those thousands of cores we have on GPUs) that cracking hashes wouldn't be that fast on GPUs. Therefore, we need the parallelization otherwise it would be faster to crack the hashes with CPU only (which in some cases like for scrypt/bcrypt with some hight cost factors is already the case).

so the short answer is, that also your OpenCL CPU uses a lot of RAM, it just (probably) doesn't have those thousands of cores but just some single ones (e.g. 16 cores). Furthermore, today we might still have more allocatable RAM compared to VRAM (but maybe this will change in the near future, the trend already is to have more and more VRAM on GPUs).
#3
Thanks Phil for that clear explanation! I understand now.     AMD will soon be releasing the pro-ssg GPU's with 1TB of
onboard memory.  I will gladly make one of these available to the hashcat team if they feel this could be a game changer for improving slow hash performance.

https://arstechnica.com/gadgets/2016/07/...ease-date/
#4
I may be mistaken, but from what I can find regarding the SSG's use of M.2 SSD's on board, it does not seem like they will be usable as VRAM. Instead, it sounds like the onboard SSD will be used as a highly accessible cache or swap area, where the host can load large datasets, and the GPU can then interact with that dataset very easily and quickly, removing the bottlenecks involved with talking to the host. This would be more like Intel Optane and less like real VRAM/RAM. The core itself would still have it's own set of VRAM, it would just be able to rapidly swap in and out of the SSD to store large things or collect next work units without having to directly ask the host board. Interesting nonetheless, just not quite the same as having 1TB of VRAM.

"Ultimately the trick for application developers is directly streaming resources from the SSDs treating it as a level of cache between the DRAM and system storage." - http://www.anandtech.com/show/10518/amd-...ds-onboard