Wednesday, November 23, 2011

FastFlow - the new multithreading framework with Atomic Queues

For the solution of achieving high frequency trading, lock free queues can help to increase the message transfer speed between threads while reduce the performance jitter as well. 

While we talks about so-called lock free mechanisms, Atomic Queue should be the first choice we considered.
However please note that Atomic operation is not really lock-free otherwise there is no data safety between threads. Atomic operation achieves lock and unlock functionality by using CPU kernel's feature that it will only process a word size memory at a time, which is exactly the size of integer.  

there are many sample atomic queues implementation in the www. we found that FastFlow is the best one which can provide directly the atomic queues as well as multi-threading framework which can help to enforce the proper software architecture design. 

the better part is that FastFlow is free to use and open sourced project, which will be actively maintained and tested by a group of users.
FastFlow architecture design

FastFlow Features


for more details, you can go to its sourceforge page: https://sourceforge.net/projects/mc-fastflow/

Thursday, November 17, 2011

Performance of Memory Copy between Host and Device with NVIDIA cards

Performance of NVIDIA Geforce 8600
  • CudaMemcpyHostToDev=18 microseconds
  • CudaMemcpyDevToHost=23 microseconds 

Performance of NVIDIA GTX 580
  • CudaMemcpyHostToDev=6 microseconds
  • CudaMemcpyDevToHost=8 microseconds

As we can see that NVIDIA new GPGPU cards like GTX series with FERMI memory architecture is much faster by itself than old NVIDIA crds like Geforce series. 

Monday, November 14, 2011

NVIDIA GPU Fermi Memory Hierarchy Review

NVIDIA GPU Fermi Memory Hierarchy

Local storage

  • Each thread has own local storage
  • Mostly registers (managed by the compiler)

Shared memory / L1

  • Program configurable: 16KB shared / 48 KB L1 OR 48KB shared / 16KB L1
  • Shared memory is accessible by the threads in the same threadblock
  • Very low latency
  • Very high throughput: 1+ TB/s aggregate

L2

  • All accesses to global memory go through L2, including copies to/from CPU host

Global memory
  • Accessible by all threads as well as host (CPU)
  • Higher latency (400-800 cycles)
  • Throughput: up to 177 GB/s


If Share Memory / L1 can be used properly, the speed up can be much greater.

High-frequency trading: Good, bad or just different? - Technology - Futures Magazine

High-frequency trading: Good, bad or just different? - Technology - Futures Magazine:
Mike O’Hara has interviewed scores of traders, connectivity providers, academics and exchange operators for his web site, the High Frequency Trading Review. He always opens his interviews with the same question: "What is high-frequency trading (HFT)?"
He never gets the same answer twice.
"The problem is that ‘high-frequency’ is a relative term," says O’Hara, a former floor trader at the London International Financial Futures Exchange (Liffe). "There are, however, some common threads in all definitions: It’s computer-driven; it generates a large number of orders in a short space of time; it’s dependent on low-latency, fast access to execution venues; its positions are held for short periods of time; it ends the day flat and it’s characterized by a high order-to-trade ratio." ...

Deep CUDA Training, Basics / Advanced

NVIDIA and NOVATTE provide their customers free CUDA trainings in Singapore, which is really a smart marketing strategies. Their Training provides clarifications and optimization tricks, quite informative.

Below is the certificate I received.
Deep CUDA Training, Basics / Advanced

Friday, November 11, 2011

Nerdblog.com: How slow is dynamic_cast?

Nerdblog.com: How slow is dynamic_cast?: C++ users are advised frequently not to use dynamic_cast, because it's slow. I thought it would be nice to measure this, so I made up a tes...

how do you compare FPGA and GPU?

we utilized the GPU in our option pricing module. its calculation speed compared to CPU is about 20%, however its parallel calculation power helps us to reduce overall latency when underlying price changed.

the latency cost in memory copy betwen GPU and CPU is a bit high, about 30 - 50 microseconds.

i feel FPGA most likely will achieve the same performane if we use it the same way as GPU. in FPGA case, might that it is more efficient in power consumption.

FPGA should be able to achieve better performance if it bypasses unnecessary os communications and utilize parallel caculation via physical logic blocks where CPU cannot do.

10GBPS NIC and Optical Fiber for High Frequency Trading


The 10 gigabit Ethernet (10GE or 10GbE or 10 GigE) computer networking standard was first published in 2002. It defines a version of Ethernet with a nominal data rate of 10 Gbit/s (billion bits per second), ten times faster than gigabit Ethernet.
10 gigabit Ethernet defines only full duplex point to point links which are generally connected by network switches. Half duplex operation, hubs and CSMA/CD (carrier sense multiple access with collision detection) do not exist in 10GbE.
The 10 gigabit Ethernet standard encompasses a number of different physical layer (PHY) standards. A networking device may support different PHY types through pluggable PHY modules, such as those based on SFP+. Over time market forces will determine the most popular 10GE PHY types.[1]
more can be read from http://en.wikipedia.org/wiki/10_Gigabit_Ethernet
10GBPS NIC cards

For high frequency traders network cards need to be carefully selected as well. 10gbps NIC can effectively outperform 1gbps cards. however it is reminded that 10gbps nic need to have 10gbps switch and optical fiber as whole solution, otherwise the speed gain will be restricted by bottle neck of any of these three.
10GBPS NIC Adaptors

Currently Dell and other OEM pc/server manufacturers provide offload features to their Network cards to save CPU power for socket data processing. the 10gbps ethernet cards provide further offload capabilities with build-in FPGA chips to bypass CPU and TCP/IP stack as much as possible. for example SolarFlare provide OpenOnLoad drivers to boost performance and their products have been implemented with many exchanges and HFT firms.

A typical performance figure with 1GBPS cards, the network latency is around 50 to 70 micro-seconds while a 10GBPS NIC from SolarFlare can reach around 5 micro-seconds.

Thursday, November 10, 2011

Recommend: book The Intelligent Investor by Benjamin Graham

The Intelligent Investor by Benjamin Graham

Before we continue to talk about other aspects of high frequency trading, I would like to share one of the investment books I have read - <<The Intelligent Investors>> by Graham, which talks about how to follow a simple principle for value investment in stock markets. To summary, it is only about patience, discipline and consistent. then you can beat at least the market.

As one of the developers for high frequency traders, I know that we must fear and avoid combating with them. I highly suggest that if you are trading in stock markets personally, you had better stop short term trading, unless all following conditions are met:

  • your algorithm system does not compete directly with other professional traders
  • your algorithm does not depends on performance of your system
  • your algorithm works well only with small enough amount of cash that it will not impact market movements


I myself is trading in Singapore Stock Exchange and China Stock Exchange following the direction of this book <<The Intelligent Investor>>. And luckily both exchanges do not have sophisticated high frequency trading that I can still run some simple spread strategies manually on top of value investment. Other reasons I am doing these simple strategies together with value investment are that there is no such luxury time to follow the market closely as a full time software engineer, and I don't like the idea to stare at market data daily. We need to spend time to enjoy our life, right? Money is important but not everything anyway.

Monday, November 7, 2011

User authority for operating system real-time capabilities

Low-latency-sensitive applications will try to pin all their memory and use (through chrt or
sched_setscheduler()) real-time priorities for application threads.

To enable this capability, create a group with these special privileges and add the required users to this
group. Typical values for this privileged group can be set in a configuration file as follows:

TIMEOUT=unlimited # in minutes
RT_PRIORITY=100
NICE=-20
MEMLOCK=unlimited
cat >>/etc/security/limits.conf <<EOF
@realtime soft cpu $TIMEOUT
@realtime - rtprio $RT_PRIORITY
@realtime - nice $NICE
@realtime - memlock $MEMLOCK
EOF

User profiles can then be added to the realtime group with usermod:
usermod -g realtime <userid>

Users must sign out and then sign back into the system for these changes to take effect.

For more information about the limits.conf file, see the limits.conf man page.

Sunday, November 6, 2011

Sandy Bridge and Overclocking

Some high frequency traders will utilize the latest sandy bridge and overclocking mechanism to increase the CPU power. with these technicals CPU clock frequency can be increased to 4.5GHz. One thing to note is that if not all your processes or threads are busy waiting, do not enable CPU power saving in your BIOS settings, otherwise thread switch will be slower when some of threads need to wake up.

Sandy Bridge or Ivy Bridge can again effectively increase CPU power by another 10% to 20%, comparing to previous CPU generation.
Intel Sandy Bridge

Sandy Bridge
Up to 17% more CPU performance clock-for-clock compared to Lynnfield processors.[20]
Around twice the integrated graphics performance compared to Clarkdale's (12 EUs comparison).

Ivy Bridge
Intel's performance targets (compared to Sandy Bridge):[21]
20% increase in CPU performance.
Up to 60% increase in integrated graphics performance.[22]

CPU Specification Comparison
Sandy BridgeIvy Bridge
SocketCoresTransistor countDie sizeSocketCoresTransistor countDie size
LGA 11554995 Million[23]216 mm2LGA115541.4 Billion[24]~172 mm2[25]
2 (6 EUs)504 Million131 mm2
2 (12 EUs)624 Million149 mm2
LGA 20114/6/8LGA 2011


Overclocking Cooling

more can be read from http://en.wikipedia.org/wiki/Sandy_Bridge for sandy bridge and http://en.wikipedia.org/wiki/Overclocking for overclocking

Thursday, November 3, 2011

What is really needed by a high frequency trader

High-Frequency Trading

What Does High-Frequency Trading - HFT Mean?
A program trading platform that uses powerful computers to transact a large number of orders at very fast speeds. High-frequency trading uses complex algorithms to analyze multiple markets and execute orders based on market conditions. Typically, the traders with the fastest execution speeds will be more profitable than traders with slower execution speeds. As of 2009, it is estimated more than 50% of exchange volume comes from high-frequency trading orders.

Please note the bold highlighted sentence, for high frequency trader, they normally do not care about your system's real latency figure. they just want to ensure that your system is faster than their competitors.
that is also to mean that your system takes less time from market tick received to order leaves for exchange. Remember the key is not how fast your system can be, the key is that your system should be faster than others.
High-Frequency Trading

for a similar strategy which takes arbitrage opportunities, the institute which has faster system will grab all market opportunities and leave nothing for competitors. It is not a game that all market participants can share the opportunities. Based on a statistics report, India's high frequency trading firms reduced from 300+ to about 30 in just one year. It is a matter of survival or not based on the speed of the system they used.
High-Frequency Trading

the latency war once starts then it will end at the situation that all systems survived in the market will become almost the same speed restricted by current technologies provided. And for high frequency trading, exchange's support is equally important. some exchanges, like that of China, actually limit the possibilities of high frequency trading strategies by putting restrictions in their gateway libraries.