08-04-2021, 07:54 PM
after searching for a few days,i think i find something.
Maxwell and Pascal GPUs use a different implementation with separate shared memory but combined L1 cache and texture cache.
Turing’s SM introduces a new unified architecture for shared memory, L1, and texture caching. This unified design allows the L1 cache to leverage resources, increasing its hit bandwidth by 2x per TPC compared to Pascal, and allows it to be reconfigured to grow larger when shared memory allocations are not using all the shared memory capacity. The Turing L1 can be as large as 64 KB in size, combined with a 32 KB per SM shared memory allocation, or it can reduce to 32 KB, allowing 64 KB of allocation to be used for shared memory.
maybe the secp256k1 compute is cache sensitive,i try to put the secp256k1_t struct on shared memory ,and seems the point_mul_xy calculation is twice as fast as before.I am a newbie on cuda develop,maybe i have found the reason?
Maxwell and Pascal GPUs use a different implementation with separate shared memory but combined L1 cache and texture cache.
Turing’s SM introduces a new unified architecture for shared memory, L1, and texture caching. This unified design allows the L1 cache to leverage resources, increasing its hit bandwidth by 2x per TPC compared to Pascal, and allows it to be reconfigured to grow larger when shared memory allocations are not using all the shared memory capacity. The Turing L1 can be as large as 64 KB in size, combined with a 32 KB per SM shared memory allocation, or it can reduce to 32 KB, allowing 64 KB of allocation to be used for shared memory.
maybe the secp256k1 compute is cache sensitive,i try to put the secp256k1_t struct on shared memory ,and seems the point_mul_xy calculation is twice as fast as before.I am a newbie on cuda develop,maybe i have found the reason?