Monday, November 14, 2011

NVIDIA GPU Fermi Memory Hierarchy Review

NVIDIA GPU Fermi Memory Hierarchy

Local storage

  • Each thread has own local storage
  • Mostly registers (managed by the compiler)

Shared memory / L1

  • Program configurable: 16KB shared / 48 KB L1 OR 48KB shared / 16KB L1
  • Shared memory is accessible by the threads in the same threadblock
  • Very low latency
  • Very high throughput: 1+ TB/s aggregate


  • All accesses to global memory go through L2, including copies to/from CPU host

Global memory
  • Accessible by all threads as well as host (CPU)
  • Higher latency (400-800 cycles)
  • Throughput: up to 177 GB/s

If Share Memory / L1 can be used properly, the speed up can be much greater.

1 comment:

  1. hi..Im student from Informatics engineering, this article is very informative, thanks for sharing :)