NVIDIA GPU Fermi Memory Hierarchy |
Local storage
- Each thread has own local storage
- Mostly registers (managed by the compiler)
Shared memory / L1
- Program configurable: 16KB shared / 48 KB L1 OR 48KB shared / 16KB L1
- Shared memory is accessible by the threads in the same threadblock
- Very low latency
- Very high throughput: 1+ TB/s aggregate
L2
- All accesses to global memory go through L2, including copies to/from CPU host
Global memory
- Accessible by all threads as well as host (CPU)
- Higher latency (400-800 cycles)
- Throughput: up to 177 GB/s
If Share Memory / L1 can be used properly, the speed up can be much greater.
No comments:
Post a Comment