Sunday 16 June 2013

Fast ReadWrite Lock



Locks are bad for performance, but still they are good fit for some scenario.

In this post i will review java Read/Write lock and alternate implementation of fast Read/Write lock 

What is Readers-Writer lock
Lets do quick re-cap of read/write lock.

As name suggest it is shared lock , it is shared between readers/writers. So you can have either reader(s) or writer. You can have multiple readers or one writer active at any given point of time. So it is very good fit for the case where number of read operation is more than write.

What is the issue with Java RW lock
By design only reader(s) or writer can be active at any given point of time, so it will cause starvation when contention is high

Advance RW lock will have additional features like
  • Priority -   for eg writer has more priority than readers or otherwise
  • Fairness  - This can be based on waiting time
It is complex to build lock with these set of rules and many things are in control of OS, for eg scheduling of thread.
With all these issues rw locks are still used.

How does RW lock perform.
Lets try to measure rw lock performance under different scenario

Details of problem that will be used for testing

Read/Write to Hashmap, writer threads adds 1 Million entry to map and reader tries to read those value back, for write operation Write lock is used and for read operation Read lock is used.
I will try to test java rw lock under no-contention and gradually try to increase contention to see how it performs.

Legend to read graph

1W - One writer
1W-1R - One writer & One reader
1W-2R - One write & Two reader 

Writer Performance



X Axis - Writer/reader Threads
Y Axis - Time to 1 Milli entry









Case when there is "1 Writer & 1 Reader" shows wired number on my system, i will ignore that for comparison.

When there is no contention writer takes around 73 ms and as we add more reader threads contention increases, it starts to slow down, in my example when total thread count reaches to 4 performance drops by 10 times. That is big drop for "1Writer-3Reader" scenario, just think what will happen if more readers are added.

Writer Throughput
Lets have look at throughput



X Axis - Writer/reader Threads
Y Axis - Throughput/Sec








I will ignore case "1W-1R" 

Throughput also takes big hit under contention, it falls from sky to roof, for 1 writer throughput is 14 million and once total thread count reaches to 4 throughput drops to 1.4 million which is 90% less.

Writer is just one part of RW lock, lets look at  the performance of reader.

Reader Performance

X Axis - Writer/reader Threads
Y Axis - Time to add 1 Million entry








Performance degrade trend continues in readers also.

Reader Throughput

Reader throughput also takes hit under contention.








Readers without writer contention
This scenario is very realistic because 

  • Mostly writer thread will populate data at application start up time
  • Or based on some write event to update data incrementally

90% + times application only reads the data and it is the performance of readers threads that will matter most by the end of day. Lets have look at performance when only readers are active.




  
In this test data in hashmap was pre-populated and only readers threads were executing.
Numbers now looks real :-) . 

It takes more time to read data as number of readers increase, same thing is observed for throughput also, it drops.

Alternate Reader/Writer Lock
Lets compare alternate implementation of RW lock

Readers without writer contention

Fast Lock outperforms java ReadWriteLock in both the test. 

In the first test, time taken to read is almost flat with Fast RW Lock and for java based it is going up like stock price as we keep on adding more readers thread. 

In throughput test java rw lock starts from 28 million and it drops to 1.6 million, that is 95% drop.
Fast rw lock starts from 58 million and drops to 34 million, that is around 40% drop.

Under heavy contention(i.e 4 readers), java readerwriter lock is around 20X times slow, that looks like comparing apple vs orange, but test result shows that.

So for readers new lock looks great, lest measure with writer.

 Writer Performance - Java RW vs Fast RW

Fastlock performs better than java RW lock under contentions 

Reader Performance - Java RW vs Fast RW



What makes FastRW lock fast
Now lets understand why new read/write lock is fast.

RW lock that i have used for testing is an implementation of SeqLock. SeqLock is very common in linux world, it provide parallelism for readers & writer. Reader never blocks writer & reader uses optimistic approach to read the data.

Readers in seqlock never update any sate in ReadWrite lock and thus reduces cpu cache traffic,so just load operation is involved for reading data.

So in nutshell it provides 

  • Less cpu cache traffic  
  • Readers never block writer
  • Reader never changes state of lock, no expensive store operation
  • Readers can slow down if write frequency is very high
  • Writer always get priority 


With so much of benefit i am sure you will never use java RW lock!

Conclusion 
SeqLock are very good alternative for reader/writer lock and especially in multicore world where we are use new technique to get best out of processor. Someday java will add support for seqlock.
Code is available @ GitHub for your reference.

Sunday 9 June 2013

CPU cache access pattern

Things to know about low latency

If you are developing low latency application then you have to remove bottleneck that can cause your latency to increase, some of the bottleneck that comes to my mind

 - Input/Output - I/O has big effect in latency it will cause CPU to stall. SSD disk may be one option, but still you want to keep this out of core processing logic.

 - Network - Same effect as I/O, so you have to reduce Network read/write in your core logic, there are many software and hardware solution to improve this.

- Locks in code - Using non-blocking techniques.

- Keeping data in RAM - I will explore this option in blog.

So we can keep all data in RAM to get best latency possible. RAM is very cheap, so lets put every thing in RAM. In theory you should be able to keep CPU busy if all the data that you need for computing is available in RAM, but in practical it doesn't happen that way because of underlying datastructure used.
CPU stalls for some time due to slow access of RAM.

Experiment
Lets try simple experiment to prove that keeping all data in RAM is just not enough to achieve low latency.

Problem:
25 million instrument object are stored in memory and each object has 2 attribute
 - Instrument Id
 - Price 
Sets of input price is passed as input and all the instrument matching that price will be returned. It is a simple search based on price attribute.

Structure Of Instrument Object
Two types of instrument object is created for this test.

 - As simple java object
Instrument object is simple java class with 2 attribute

 class Instrument
{
public int id;
private int price;
public Instrument(int id, int price) {
super();
this.id = id;
this.price = price;
}

public int getId() {
return id;
}

public int getPrice() {
return price;
}
}

- Column type object
For each attribute array is created, this structure is more CPU cache friendly, we will why is is so later.
Sample ColumnInstrument Object

 class ColumnInstrumentStore
{

private int[] ids;
private int[] prices;

public ArrayInstrumentStore(int size)
{
ids = new int[size];
prices = new int[size];
}


public int getId(int index) {
return ids[index];
}


public int getPrice(int index) {
return prices[index];
}
public void setValue(int index, int id, int price) {
ids[index] = id;
prices[index] = price;
}

}

Lets measure performance
Lets try to measure search cost for 2 types of instrument object,

 X axis - No of reading
 Y axsis - Time taken in Ms

Linear search is performed on 25 million instrument object.
Instrument object defined as column is around 5% to 25% faster as compared to normal java based instrument object, on average it is 15% faster as compared to java based object.

Lets look at another number - Time take for each iteration. Time taken for 1 iteration is very less so i have used nano second to measure that.



 X axis - No of reading
 Y axis - Time taken in NanoSecond

This explains why search on column based objects is fast, time spent per iteration is around 15% less .

Why Column based object is fast
Lets look at the reason why search performs better on Column layout type of object as compared to normal object.

 - CPU Pre-Fetching - Today almost all the processor supports hardware pre-fetch. CPU pre-fetch data that is required by application to reduce memory latency

 - Better use of CPU cache - CPU moves data from main memory to cache via cache lines, in current generation of CPU, cache line is of 64 byte, so if you ask for 1 byte of data, processor will still bring 64 byte of data from ram to CPU cache line. This is called - Spatial locality, definition from wikipedia -

Spatial locality: if a particular memory location is referenced at a particular time, then it is likely that nearby memory locations will be referenced in the near future. In this case it is common to attempt to guess the size and shape of the area around the current reference for which it is worthwhile to prepare faster access

This CPU behavior can be used to reduce round trip to ram by co-locating you data.

Lets take Instrument object to understand layout

- Object Layout


- Column Layout


In case of column based instrument, data is stored in integer array, if program request for single value of array , it will get next 15 element of array due to cache line transfer, but in case of normal object it will get only 7 instrument price.
Normal object layout requires double round trip to memory and that is the reason why it is slow.
If you co locate your objects then you can use CPU cache much more efficiently.

Most of the real object are more complex than Instrument object that i have used and you can imagine number of round trip it require to RAM.
You can get more details of memory latency from latency-number-that-you-should-know
Just to put those number in context -

L1 Cache : 0.5 ns
Main Memory  : 100 ns ( 20x times more than L1)


How it performs in multi-threaded world.
In today's time it is impossible to get machine with single core, we live in multi core world!
Each logical core has its own private cache, so if you data structure is cache friendly then it will scale linearly as you add more threads/core to your processing.

Lets look at some number for multi threaded search.

Search latency


Loop Iteration

Both search & iteration per second is improving as more number of threads are added, this will become more effective for NUMA machines because memory access is not uniform.

Just by having CPU cache friendly layout you can improve application performance
Column layout have more benefit, more on next blog.