Thursday, December 29, 2011

Strategies to find suitable holes for memory segment allocation request

There are various strategies utilized by modern OS to find suitable memory holes for segment allocation requests from programs.

There are normally four of them:

  • First Fit
  • Best Fit
  • Quick Fit 
  • Buddy System

Buddy System is one of the smartest and most efficient strategy which are widely used. The same strategy logic might be able to apply to our trading system designs to achieve the best efficiency thus lowest possible latency.

Buddy System - Memory Hole Searching Strategy



buddy system,where the allocation and deallocation of memory is always in the order of a power of 2. A request for a segment allocation is rounded to the nearest power of 2 that is greater than or equal to the requested amount. The memory manager maintains n, n >=1, lists of holes. List(i) i=0, ...,n-1, holds all holes of size 2 to power of i. A hole may be removed from List(i), and split into two holes of size 2 to power of i-1 (called ‘buddies’ of size 2 to power of i, see picture above), and the two holes are entered in List(i-1). Conversely, a pair of buddies of size 2 to power of i may be removed from List(i), coalesced into a single larger hole, and the new hole of size 2 to power of i+1 is entered in List(i+1). To allocate a hole of size 2 to power of i, the search is started at List(i). If the list is not empty, a hole from the list is allocated. Otherwise, get a hole of size 2 to power of (i+1) from List(i+1), split the hole into two, put one in List(i), and allocate the other one. Hole deallocation is done in reverse fashion: to free a hole of size 2 to power of i, put it in List(i); if its buddy is already there, remove both, coalesce, and insert the coalesced hole in List(i+1). This insertion may cause coalescing of two buddies, the irremoval from List(i+1), and a new insertion in List(i+2), etc.

Monday, December 26, 2011

Wednesday, December 14, 2011

Pros and cons of disabling C-STATE (and C1E)

For the BIOS to have full control of all the features of the newer cpus, they all need to be enabled.
Maybe this will help (taken from another post):
That was the case for older CPU's but the i3/i5/i7 benefit from both, SpeedStep is better for changing the multiplier/voltage but C State has additional benefits on the new Intel CPU's, instead of the whole CPU either being on/off/idle parts of the CPU can now be turned on/off or set to idle and this works in conjunction with intels Turbo Mode.
So basically they did do the same job but there are benefits to having both on when it comes to the new i3/i5/i7 CPU's.
you will want to set CxE Function to C6 to get these new benefits alongside having SpeedStep enabled (they can work independent of each other but its best to have both enabled, be warned though with newer EVGA BIOS's having CxE Function enabled will allow the higher Turbo Mode multipliers to kick in and could make your OC unstable, if this is the case disable CxE Function but you could keep SpeedStep enabled if it still works, on the X58 Classified the voltage part of SpeedStep does not work with a manually inserted Voltage, it does however still work on the E758 3X SLI board with a manually inserted vCore voltage, this is just due to the components used and how the boards are set-up due to the segments they are for, Classified being a primarily overclocking board when power saving features are secondary. There are still work around for the X58 Classifieds using the ECP, this should allow you to OC the CPU but use an AUTO voltage which would allow the voltage part of SpeedStep to work
 More details please find it at http://www.techsupportforum.com/forums/f15/pros-and-cons-of-disabling-c-state-and-c1e-559253.html 

Monday, December 12, 2011

Understand OS Scheduling for better system performance

Current modern OS, which is interactive privileged over real-time privileged, normally adopt Round-Robin Scheduling strategy, which is more effective in allocating CPU resources to active processes than FCFS (First Come First Serve).
Round Robin Scheduling

For example as below picture which showing how the CPU is allocated to processes. The response time for P1, P2, P3, P4, and P5 are 30, 24, 42, 14, and 18 time units, respectively. The average response time is 25.6 time units,which is better than that (of 28.4) for FCFS scheduling. Nevertheless, RR scheduling leads to more context switches.
Process CPU Allocation
And please note that OS has a so-called Fair Share Scheduling among users and groups. To has a high priority process to gain more CPU time slices, it is better to have a user to only run this process. To better utilize the Round Robin Scheduling, time slices need to be carefully defined that the core logic of your process can be finished within a time slice that it will never scheduled out and wait again for next slices. In this way, your high frequency trading solutions can run more efficient and occupy more CPU power to finish its tasks.

Thursday, December 8, 2011

Better utilize CPU L1 Cache & L2 Cache for Performance Tuning

To better utilize CPU L1 Cache & L2 Cache for Performance Tuning, we need to understand a few important points:
1. Cache and RAM like cache size and why cache is needed in modern CPUs. one fact is that CPU is much faster than memory speed that current system performance bottle neck is memory access and its PCIe BUS speed.
CPU Cache & RAM Architecture
 2. Cache Miss - A cache miss refers to a failed attempt to read or write a piece of data in the cache, which results in a main memory access with much longer latency. There are three kinds of cache misses: instruction read miss, data read miss, and data write miss. details can refer to http://en.wikipedia.org/wiki/CPU_cache#Cache_miss  

a research report on "Optimization Study for Multicores" by Muneeb Anwar Khan shows that how cache can be better utilized to achieve better system performance thus reduce latency. Please note Acumem is the profiling tool he used for identify the problematic codes.

one simple example: look at the source code:
Better utilize CPU L1 Cache & L2 Cache for Performance Tuning


 Problem 1:
The report shows problem 1’s fetches to be 30.8% of all the memory fetches in this application, with a fetch utilization of 43.1% in the first highlighted statement. The instruction stats show the misses to be 34% of all the cache misses in the application, and fetch and miss ratio at 21.2%. Reducing the fetch and miss ratio would
greatly help improve bandwidth issues. 

 Problem 2:
The report points out at poor fetch utilization for the second highlighted statement.
Having an identical miss and fetch ratio of 15.4%; it has an extremely poor fetch
utilization of only 12.8%. 

 let us see the revised code based on the identification of 2 problems:
Better utilize CPU L1 Cache & L2 Cache for Performance Tuning

what is the performance improvement? with just a few simple modifications by eliminating the unnecessary cache of not-used data, the speedup is about 2.9X.
Better utilize CPU L1 Cache & L2 Cache for Performance Tuning

By better utilizing CPU caches, a further low latency high frequency trading platforms can be achieved within the scope of CPU host itself. 

Friday, December 2, 2011

SSD Read/Write Performance

SSD Read/Write Performance
Information get from http://www.tomshardware.com/reviews/best-ssd-price-per-gb-ssd-performance,2942.html

One random read from SSD takes about 20 to 50 micro-seconds. Hopefully it can be much faster in near future to allow CPU OS to read/write with SSD like it is a bit slower RAM.

Wednesday, November 23, 2011

FastFlow - the new multithreading framework with Atomic Queues

For the solution of achieving high frequency trading, lock free queues can help to increase the message transfer speed between threads while reduce the performance jitter as well. 

While we talks about so-called lock free mechanisms, Atomic Queue should be the first choice we considered.
However please note that Atomic operation is not really lock-free otherwise there is no data safety between threads. Atomic operation achieves lock and unlock functionality by using CPU kernel's feature that it will only process a word size memory at a time, which is exactly the size of integer.  

there are many sample atomic queues implementation in the www. we found that FastFlow is the best one which can provide directly the atomic queues as well as multi-threading framework which can help to enforce the proper software architecture design. 

the better part is that FastFlow is free to use and open sourced project, which will be actively maintained and tested by a group of users.
FastFlow architecture design

FastFlow Features


for more details, you can go to its sourceforge page: https://sourceforge.net/projects/mc-fastflow/

Thursday, November 17, 2011

Performance of Memory Copy between Host and Device with NVIDIA cards

Performance of NVIDIA Geforce 8600
  • CudaMemcpyHostToDev=18 microseconds
  • CudaMemcpyDevToHost=23 microseconds 

Performance of NVIDIA GTX 580
  • CudaMemcpyHostToDev=6 microseconds
  • CudaMemcpyDevToHost=8 microseconds

As we can see that NVIDIA new GPGPU cards like GTX series with FERMI memory architecture is much faster by itself than old NVIDIA crds like Geforce series. 

Monday, November 14, 2011

NVIDIA GPU Fermi Memory Hierarchy Review

NVIDIA GPU Fermi Memory Hierarchy

Local storage

  • Each thread has own local storage
  • Mostly registers (managed by the compiler)

Shared memory / L1

  • Program configurable: 16KB shared / 48 KB L1 OR 48KB shared / 16KB L1
  • Shared memory is accessible by the threads in the same threadblock
  • Very low latency
  • Very high throughput: 1+ TB/s aggregate

L2

  • All accesses to global memory go through L2, including copies to/from CPU host

Global memory
  • Accessible by all threads as well as host (CPU)
  • Higher latency (400-800 cycles)
  • Throughput: up to 177 GB/s


If Share Memory / L1 can be used properly, the speed up can be much greater.

High-frequency trading: Good, bad or just different? - Technology - Futures Magazine

High-frequency trading: Good, bad or just different? - Technology - Futures Magazine:
Mike O’Hara has interviewed scores of traders, connectivity providers, academics and exchange operators for his web site, the High Frequency Trading Review. He always opens his interviews with the same question: "What is high-frequency trading (HFT)?"
He never gets the same answer twice.
"The problem is that ‘high-frequency’ is a relative term," says O’Hara, a former floor trader at the London International Financial Futures Exchange (Liffe). "There are, however, some common threads in all definitions: It’s computer-driven; it generates a large number of orders in a short space of time; it’s dependent on low-latency, fast access to execution venues; its positions are held for short periods of time; it ends the day flat and it’s characterized by a high order-to-trade ratio." ...

Deep CUDA Training, Basics / Advanced

NVIDIA and NOVATTE provide their customers free CUDA trainings in Singapore, which is really a smart marketing strategies. Their Training provides clarifications and optimization tricks, quite informative.

Below is the certificate I received.
Deep CUDA Training, Basics / Advanced

Friday, November 11, 2011

Nerdblog.com: How slow is dynamic_cast?

Nerdblog.com: How slow is dynamic_cast?: C++ users are advised frequently not to use dynamic_cast, because it's slow. I thought it would be nice to measure this, so I made up a tes...

how do you compare FPGA and GPU?

we utilized the GPU in our option pricing module. its calculation speed compared to CPU is about 20%, however its parallel calculation power helps us to reduce overall latency when underlying price changed.

the latency cost in memory copy betwen GPU and CPU is a bit high, about 30 - 50 microseconds.

i feel FPGA most likely will achieve the same performane if we use it the same way as GPU. in FPGA case, might that it is more efficient in power consumption.

FPGA should be able to achieve better performance if it bypasses unnecessary os communications and utilize parallel caculation via physical logic blocks where CPU cannot do.

10GBPS NIC and Optical Fiber for High Frequency Trading


The 10 gigabit Ethernet (10GE or 10GbE or 10 GigE) computer networking standard was first published in 2002. It defines a version of Ethernet with a nominal data rate of 10 Gbit/s (billion bits per second), ten times faster than gigabit Ethernet.
10 gigabit Ethernet defines only full duplex point to point links which are generally connected by network switches. Half duplex operation, hubs and CSMA/CD (carrier sense multiple access with collision detection) do not exist in 10GbE.
The 10 gigabit Ethernet standard encompasses a number of different physical layer (PHY) standards. A networking device may support different PHY types through pluggable PHY modules, such as those based on SFP+. Over time market forces will determine the most popular 10GE PHY types.[1]
more can be read from http://en.wikipedia.org/wiki/10_Gigabit_Ethernet
10GBPS NIC cards

For high frequency traders network cards need to be carefully selected as well. 10gbps NIC can effectively outperform 1gbps cards. however it is reminded that 10gbps nic need to have 10gbps switch and optical fiber as whole solution, otherwise the speed gain will be restricted by bottle neck of any of these three.
10GBPS NIC Adaptors

Currently Dell and other OEM pc/server manufacturers provide offload features to their Network cards to save CPU power for socket data processing. the 10gbps ethernet cards provide further offload capabilities with build-in FPGA chips to bypass CPU and TCP/IP stack as much as possible. for example SolarFlare provide OpenOnLoad drivers to boost performance and their products have been implemented with many exchanges and HFT firms.

A typical performance figure with 1GBPS cards, the network latency is around 50 to 70 micro-seconds while a 10GBPS NIC from SolarFlare can reach around 5 micro-seconds.

Thursday, November 10, 2011

Recommend: book The Intelligent Investor by Benjamin Graham

The Intelligent Investor by Benjamin Graham

Before we continue to talk about other aspects of high frequency trading, I would like to share one of the investment books I have read - <<The Intelligent Investors>> by Graham, which talks about how to follow a simple principle for value investment in stock markets. To summary, it is only about patience, discipline and consistent. then you can beat at least the market.

As one of the developers for high frequency traders, I know that we must fear and avoid combating with them. I highly suggest that if you are trading in stock markets personally, you had better stop short term trading, unless all following conditions are met:

  • your algorithm system does not compete directly with other professional traders
  • your algorithm does not depends on performance of your system
  • your algorithm works well only with small enough amount of cash that it will not impact market movements


I myself is trading in Singapore Stock Exchange and China Stock Exchange following the direction of this book <<The Intelligent Investor>>. And luckily both exchanges do not have sophisticated high frequency trading that I can still run some simple spread strategies manually on top of value investment. Other reasons I am doing these simple strategies together with value investment are that there is no such luxury time to follow the market closely as a full time software engineer, and I don't like the idea to stare at market data daily. We need to spend time to enjoy our life, right? Money is important but not everything anyway.

Monday, November 7, 2011

User authority for operating system real-time capabilities

Low-latency-sensitive applications will try to pin all their memory and use (through chrt or
sched_setscheduler()) real-time priorities for application threads.

To enable this capability, create a group with these special privileges and add the required users to this
group. Typical values for this privileged group can be set in a configuration file as follows:

TIMEOUT=unlimited # in minutes
RT_PRIORITY=100
NICE=-20
MEMLOCK=unlimited
cat >>/etc/security/limits.conf <<EOF
@realtime soft cpu $TIMEOUT
@realtime - rtprio $RT_PRIORITY
@realtime - nice $NICE
@realtime - memlock $MEMLOCK
EOF

User profiles can then be added to the realtime group with usermod:
usermod -g realtime <userid>

Users must sign out and then sign back into the system for these changes to take effect.

For more information about the limits.conf file, see the limits.conf man page.

Sunday, November 6, 2011

Sandy Bridge and Overclocking

Some high frequency traders will utilize the latest sandy bridge and overclocking mechanism to increase the CPU power. with these technicals CPU clock frequency can be increased to 4.5GHz. One thing to note is that if not all your processes or threads are busy waiting, do not enable CPU power saving in your BIOS settings, otherwise thread switch will be slower when some of threads need to wake up.

Sandy Bridge or Ivy Bridge can again effectively increase CPU power by another 10% to 20%, comparing to previous CPU generation.
Intel Sandy Bridge

Sandy Bridge
Up to 17% more CPU performance clock-for-clock compared to Lynnfield processors.[20]
Around twice the integrated graphics performance compared to Clarkdale's (12 EUs comparison).

Ivy Bridge
Intel's performance targets (compared to Sandy Bridge):[21]
20% increase in CPU performance.
Up to 60% increase in integrated graphics performance.[22]

CPU Specification Comparison
Sandy BridgeIvy Bridge
SocketCoresTransistor countDie sizeSocketCoresTransistor countDie size
LGA 11554995 Million[23]216 mm2LGA115541.4 Billion[24]~172 mm2[25]
2 (6 EUs)504 Million131 mm2
2 (12 EUs)624 Million149 mm2
LGA 20114/6/8LGA 2011


Overclocking Cooling

more can be read from http://en.wikipedia.org/wiki/Sandy_Bridge for sandy bridge and http://en.wikipedia.org/wiki/Overclocking for overclocking

Thursday, November 3, 2011

What is really needed by a high frequency trader

High-Frequency Trading

What Does High-Frequency Trading - HFT Mean?
A program trading platform that uses powerful computers to transact a large number of orders at very fast speeds. High-frequency trading uses complex algorithms to analyze multiple markets and execute orders based on market conditions. Typically, the traders with the fastest execution speeds will be more profitable than traders with slower execution speeds. As of 2009, it is estimated more than 50% of exchange volume comes from high-frequency trading orders.

Please note the bold highlighted sentence, for high frequency trader, they normally do not care about your system's real latency figure. they just want to ensure that your system is faster than their competitors.
that is also to mean that your system takes less time from market tick received to order leaves for exchange. Remember the key is not how fast your system can be, the key is that your system should be faster than others.
High-Frequency Trading

for a similar strategy which takes arbitrage opportunities, the institute which has faster system will grab all market opportunities and leave nothing for competitors. It is not a game that all market participants can share the opportunities. Based on a statistics report, India's high frequency trading firms reduced from 300+ to about 30 in just one year. It is a matter of survival or not based on the speed of the system they used.
High-Frequency Trading

the latency war once starts then it will end at the situation that all systems survived in the market will become almost the same speed restricted by current technologies provided. And for high frequency trading, exchange's support is equally important. some exchanges, like that of China, actually limit the possibilities of high frequency trading strategies by putting restrictions in their gateway libraries.

Monday, October 31, 2011

using SSD for high frequency trading

BiTMICRO made a number of introductions and announcements in 1999 around flash-based SSDs including an 18 GB 3.5 in SSD.[19] Fusion-io announced a PCIe-based SSD with 100,000 input/output operations per second (IOPS) of performance in a single card with capacities up to 320 gigabytes in 2007.[20] At Cebit 2009, OCZ demonstrated a 1 terabyte (TB) flash SSD using a PCI Express ×8 interface. It achieves a maximum write speed of 654 megabytes per second (MB/s) and maximum read speed of 712 MB/s.[21] In December 2009, Micron Technology announced the world's first SSD using a 6 gigabits per second (Gbit/s) or 600 (MB/s) SATA interface.[22]
PCI attached IO Accelerator SSD


from http://en.wikipedia.org/wiki/Solid-state_drive

we can utilize SSD to replace normal HDD in OS for better performance while reading and writing IO will be involved.
Please note that SSD can achieve roughly 100,000 read/write per-second, which is about 10 micro-seconds per operation if data size involved is less than 7KB referring to the throughput of 712MB/s. This speed is still not comparable to read/write from DDR SDRAM. And Samsung has developed the first DDR4 RAM that is even faster. 
.

about High-frequency trading

I am working on high frequency trading platform for the last 3 years. currently most of our competitors now talks about latency for the round trip time takes at the magnitude of single digit micro-seconds within their own system, that is from market tick received to order leaves for exchange.  

Currently fastest speed of the round trip time I heard of  is achieved by utilizing FPGA technology, which has latency of only 2 micro-seconds.

Below is rough description of the idea of  speed of high frequency trading, but do not be fooled with the latency number they provided.  We can have further discussion or idea sharing in my following posts. A basic guideline is to make your market tick to order path as short as possible. And outside software itself, different combination of OS, Network Cards, Router, Switch will have huge impact on the system performance. FPGA board has its advantage to bypass OS completely.

High-frequency trading now revolves around microseconds and even nanoseconds. Picoseconds are on the horizon.
WHAT’S IN A SECOND?
    1 millisecond (ms) = one thousandth of a second
    1 microsecond (us) = one millionth of a second
    1 nanosecond (ns) = one billionth of second
    1 picosecond (ps) = one-trillionth of a second
A fast trader can type and submit perhaps five trades in a minute, said Paul Michaud, a trading and risk management specialist at the software group of International Business Machines Corp <IBM.N> in Houston. In those 60 seconds, exchange systems and black boxes will soon be able to transmit 60 million trades, he said.
“Generally people view we’re in a race to zero here. I mean literally we’re in a race to zero. Speed of light is actually an issue for a lot of our clients,” Michaud said.  
“An electrical signal can travel down a wire 200 meters in one microsecond,” said Greg Allen, vice president of governance, architecture and planning at TMX Group Inc <X.TO>, parent of the Toronto Stock Exchange.
“A blink of the eye is about 200 milliseconds,” Allen said.     “The fastest exchange or ATS (alternative trading system) would’ve been in the range of 5 milliseconds,” referring to trading venues built up to five years ago. “Now, the the best ones claim to be around 500 microseconds — so half a millisecond.”

Saturday, October 8, 2011

Stack vs Heap Allocation

Quote:
Stack vs Heap Allocation

We conclude our discussion of storage class and scope by breifly describing how the memory of the computer is organized for a running program. When a program is loaded into memory, it is organized into three areas of memory, called segments: the text segment, stack segment, and heap segment. The text segment (sometimes also called the code segment) is where the compiled code of the program itself resides. This is the machine language representation of the program steps to be carried out, including all functions making up the program, both user defined and system.

The remaining two areas of system memory is where storage may be allocated by the compiler for data storage. The stack is where memory is allocated for automatic variables within functions. A stack is a Last In First Out (LIFO) storage device where new storage is allocated and deallocated at only one ``end'', called the Top of the stack.

When a program begins executing in the function main(), space is allocated on the stack for all variables declared within main(). If main() calls a function, func1(), additional storage is allocated for the variables in func1() at the top of the stack. Notice that the parameters passed by main() to func1() are also stored on the stack. If func1() were to call any additional functions, storage would be allocated at the new Top of stack as seen in the figure. When func1() returns, storage for its local variables is deallocated, and the Top of the stack returns to to position. If main() were to call another function, storage would be allocated for that function at the Top shown in the figure. As can be seen, the memory allocated in the stack area is used and reused during program execution. It should be clear that memory allocated in this area will contain garbage values left over from previous usage.

The heap segment provides more stable storage of data for a program; memory allocated in the heap remains in existence for the duration of a program. Therefore, global variables (storage class external), and static variables are allocated on the heap. The memory allocated in the heap area, if initialized to zero at program start, remains zero until the program makes use of it. Thus, the heap area need not contain garbage.
In case forget the difference between heap and stack

C++ General: How is floating point representated?

C++ General: How is floating point representated?
http://www.codeguru.com/forum/showthread.php?t=323835


another very interesting article talking about the Integer Security

http://www.codeguru.com/cpp/sample_chapter/article.php/c11111

 

C++ High Performance Dynamic Typing

 
one is boost:any
another is from http://www.codeproject.com/KB/cpp/dynamic_typing.aspx

both can map any type of variable into its "any" type

Fast Memory Management

For example we have an array of 1 Million records, each record is a structure. 

How can we achieve high performance to identify the empty slot for use? 

we can use the same trick Oracle used: 
We use integers's every bit to save the status flag of each element of the array. 

normally one un-signed integer is 32 bits, so for one Million records, we need total : 1000000 / 32 = 31250 integers, if it is long, we can devide it by 2 again. 

for when we look for an empty slot, we can check whether first integer's value is 4294967295 (MAX_INT if I am correct) 
if it is less than MAX_INT, we then search bit by bit of this unsigned integer to get one bit with zero value. 
if it is equal to MAX_INT, then move to next integer. 

to search for 31250 unsigned integers, it is still quite a lot, maybe we can find some other ways to achieve a more effecient method. 

Once we find the empty slot and used it, we should bring along its index when the information of that record has to be passed among processes via message queue or other ways, otherwise search such big array could be a painful experience.

How to represent floating number in binary mode?

The answer to the OP's question can be found in the C
standard (to which the C++ standard delegates) and is
reasoned on the following *symmetric* floating-point
model, described in (C99) 5.2.4.2.2:
x = s b^e Sum[f_k b^(-k), {k, 1, p}]
where
x: Any floating-point number
s: Sign (+1 or -1)
b: Base/radix of exponent repr.
e: Exponent, where e_min <= e <= e_max
p: Precision
f_k: Nonnegative integer digits
using Mathematica notation as described in
http://documents.wolfram.com/mathematica/functions/Sum
and where _ and ^ denote a subscripts and a
superscript, resp.
This obvious symmetry of floating point values
relative to the sign is also demanded *not* to be
disturbed by subnormals, infinities and/or NaN's,
which follows by further constraints described in
5.2.4.2.2/3

Sequence processing will always be faster in performance

Consider some code that sums a square array:
[code]   for (row = 0; row < N;, ++row)
      for (col = 0; col < N; ++col)
         sum += A[row][col];[/code] 
Or you can do it the other way round:
[code]   for (col = 0; col < N; ++col)
      for (row = 0; row < N; ++row)
         sum += A[row][col];[/code]
So does it matter? Indeed it does!
In C++ arrays are stored row-wise in contiguous memory. So if you traverse the array rows first you will traverse the data sequentially. That means that the next data point you need is likely to be in pipeline, cache, RAM, and the next hard drive sector before you need it. But if you traverse columns first then you will be repeatedly reading just one item from each row before reading from the next row. As a result your system's caching and lookahead will fail, and you will waste a lot of time waiting for RAM, which is about three to five times slower than cache. Even worse, you may wind up waiting for the hard drive, which can be millions of times slower than RAM if accessed in large random jumps.
How much time is wasted? On my machine summing a billion bytes row-wise takes 9 seconds, whereas summing them column-wise takes 51 seconds. The less sequential your data acccess, the slower your program will run.