Thursday, December 8, 2011

Better utilize CPU L1 Cache & L2 Cache for Performance Tuning

To better utilize CPU L1 Cache & L2 Cache for Performance Tuning, we need to understand a few important points:
1. Cache and RAM like cache size and why cache is needed in modern CPUs. one fact is that CPU is much faster than memory speed that current system performance bottle neck is memory access and its PCIe BUS speed.
CPU Cache & RAM Architecture
 2. Cache Miss - A cache miss refers to a failed attempt to read or write a piece of data in the cache, which results in a main memory access with much longer latency. There are three kinds of cache misses: instruction read miss, data read miss, and data write miss. details can refer to http://en.wikipedia.org/wiki/CPU_cache#Cache_miss  

a research report on "Optimization Study for Multicores" by Muneeb Anwar Khan shows that how cache can be better utilized to achieve better system performance thus reduce latency. Please note Acumem is the profiling tool he used for identify the problematic codes.

one simple example: look at the source code:
Better utilize CPU L1 Cache & L2 Cache for Performance Tuning


 Problem 1:
The report shows problem 1’s fetches to be 30.8% of all the memory fetches in this application, with a fetch utilization of 43.1% in the first highlighted statement. The instruction stats show the misses to be 34% of all the cache misses in the application, and fetch and miss ratio at 21.2%. Reducing the fetch and miss ratio would
greatly help improve bandwidth issues. 

 Problem 2:
The report points out at poor fetch utilization for the second highlighted statement.
Having an identical miss and fetch ratio of 15.4%; it has an extremely poor fetch
utilization of only 12.8%. 

 let us see the revised code based on the identification of 2 problems:
Better utilize CPU L1 Cache & L2 Cache for Performance Tuning

what is the performance improvement? with just a few simple modifications by eliminating the unnecessary cache of not-used data, the speedup is about 2.9X.
Better utilize CPU L1 Cache & L2 Cache for Performance Tuning

By better utilizing CPU caches, a further low latency high frequency trading platforms can be achieved within the scope of CPU host itself. 

No comments:

Post a Comment