High Frequency Trading: NVIDIA GPU Fermi Memory Hierarchy Review

Monday, November 14, 2011

NVIDIA GPU Fermi Memory Hierarchy Review

NVIDIA GPU Fermi Memory Hierarchy

Local storage

Each thread has own local storage
Mostly registers (managed by the compiler)

Shared memory / L1

Program configurable: 16KB shared / 48 KB L1 OR 48KB shared / 16KB L1

Shared memory is accessible by the threads in the same threadblock

Very low latency

Very high throughput: 1+ TB/s aggregate

L2

All accesses to global memory go through L2, including copies to/from CPU host

Global memory

Accessible by all threads as well as host (CPU)
Higher latency (400-800 cycles)
Throughput: up to 177 GB/s

If Share Memory / L1 can be used properly, the speed up can be much greater.

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)