mask attack: the candidates are directly generated/used on GPU, within the GPU kernel code
attack involving dictionaries (-a 0, -a 1, -a 6, -a 7): disk I/O bottleneck AND passwords need to be "send"/transfered from host RAM ("CPU") to device VRAM ("GPU") (PCIe "bottleneck")
there is an exception with slow hashes that only have init/loop/comp kernel functions: for some attacks the password candidates are not directly generated and used on-the-fly within the same GPU kernel code, because of the "slow hashing algorithm bottleneck" (no need and no performance difference, because of cost factor/iterations etc)
there are ways to speed things up: -w 3, -O (if applicable) and a good amount of rules in -a 0 attack (rule-based-attacks, see https://hashcat.net/wiki/doku.php?id=rule_based_attack)
attack involving dictionaries (-a 0, -a 1, -a 6, -a 7): disk I/O bottleneck AND passwords need to be "send"/transfered from host RAM ("CPU") to device VRAM ("GPU") (PCIe "bottleneck")
there is an exception with slow hashes that only have init/loop/comp kernel functions: for some attacks the password candidates are not directly generated and used on-the-fly within the same GPU kernel code, because of the "slow hashing algorithm bottleneck" (no need and no performance difference, because of cost factor/iterations etc)
there are ways to speed things up: -w 3, -O (if applicable) and a good amount of rules in -a 0 attack (rule-based-attacks, see https://hashcat.net/wiki/doku.php?id=rule_based_attack)