Friday, 14 March 2014

Off Heap concurrent counter

Concurrent counter are part of almost every system, it is used to collect data, thread synchronization etc. 
Java has good support of heap based counter.

There could be use case when you need counter that can be shared between processor.

How to build inter process counters 
 - Database 
This is first option that comes to mind, database sequence is counter that can be used by multiple process. All concurrency is handled by database. It is good option for starter but we know types of overhead(Network,Locks etc) you get with database. Only Larry Elision will be happy about it, not you!
 - Some Server
You could develop some server/middleware that provides such type of service. This option will still have network latency,marshal/unmarshal overhead.

 - Memory Mapped file
You could use memory mapped file to do this. I got idea from looking at thread-safe-interprocess-shared-memory-in-java presentation from PeterLawrey.

Challenges involved in multi process counter.
- Data visibility 
    Changes done by one process should be visible to all the process. This problem can be solved by using memory mapped file. Operating System gives guarantee about it and java memory model supports it to make is possible. 

- Thread safety 
Counters is all about multiple writers , so thread safety becomes big problem. Compare-and-swap is one option to handles multiple writers.
Is it possible to use CAS for off heap operation ? yes it is possible to do that , welcome to Unsafe.
By using using Memorymapped & Unsafe it is possible to use CAS for Off heap operation.

In this blog i will share my experiment of Memory mapped using CAS.

How ?
- How to get memory address
MappedByteBuffer is using DirectByteBuffer, which is off heap memory. So it is possible to get virtual address of memory and use unsafe to perform CAS operation. Lets look at the code.

Above code create memory mapped file of 8 bytes and get the virtual address. This address can be be used to read/write content of memory mapped file.

- How to Write/read in thread safe manner

Important function to look are readVolatile and increment
readVolatile reads directly from memory and increment is using unsafe to perform CAS on the address obtained from MemoryByteBuffer.

Some performance numbers from my system. Each thread increments counter 1 Million times.

Performance of counter is decent , as number of threads are increased CAS failures starts to happen and performance starts to degrade.
Performance of these counter can be improved by having multiple segment to reduce write contention.
I will write about it in next blog.

 - Memory mapped file is very powerful, it can be used to developed lot of things like off heap collections, IPC, Off heap thread coordination etc.
  - Memory mapped file opens gates for GC less programming.

All the code used in this blog is available on github.

Monday, 17 February 2014

AtomicInteger Java 7 vs Java 8

Atomic Integer is interesting class, it is used for building many lock free algorithm. Infact JDK locks are also build using ideas from Atomic datatypes.

As name suggest it is used for doing atomic increment/decremented, so you don't have to use locks, it will use processor level instruction to do so.
It is based on Compare-and-swap instruction.

Issue with CAS
CAS works on optimistic approach, it expects some failure, so it will retry operation, so in theory if there is no contention then it should work pretty fast.

There is another alternate way of doing same thing using Fetch-and-add.
Fetch-and-add is very different from CAS, it is not based on re-try loops.

Dave Dice compares CAS vs Fetch-and-add in atomic_fetch_and_add_vs blog, i can't explain better than this, so i will copy content from his blog

  1. CAS is "optimistic" and admits failure, whereas XADD does not. With XADD there's no explicit window of vulnerability to remote interference, and thus no need for a retry loop. Arguably, XADD has better progress properties, assuming the underlying XADD implementation doesn't have an implicit loop, but even in that case the window would be narrower than with Load;Φ;CAS.
  2. Lets say you were trying to increment a variable with the usual Load;INC;CAS loop. When the CAS starts failing with sufficient frequency you can find that the branch to exit the loop (normally taken under no or light contention) starts to predict toward the failure path. So when the CAS ultimately succeeds, you'll incur a branch mispredict, which can be quite painful on processors with deep pipelines and lots of out-of-order speculative machinery. Typically, this is in a piece of code where you don't want a long stall. There's no loop and no such issues with XADD.
Since fetch-and-add has predictable progress properties, so it is used for developing waiting free algorithms.
Unfortunately JDK 7 does not have support for fetch-and-add, one more reason why C++ people will be happy that C++ is great!

As we all know things do change & java community decided to added support for fetch-and-add in JDK8, one more good reason to migrate to JDK8.

In this blog i will compare performance of AtomicInteger from JDK7 & 8

Atomic Integer - JDK 7 vs JDK 8

In this test i increment counter 10 Million times with different number of threads. Thread numbers are increased to see how counter performs under contention.

X Axis - No of Threads
Y Axis - Ops/Second - Higher is better

JDK8 counter is winner in this case, best performance is when there is no contention , for JDK7 it is 80 MOPS but for JDK8 it close to 130 MOPS.
For single thread difference is not much , JDK8 is around 0.5 times faster but as contention increases performance JDK7 counter starts falling.

I will put another graph by removing 1 thread number, so that we can clearly see how these counter performs.

This gives better idea of how slow JDK7 atomic integer is, for 8 threads JDK8 counter is around 3.5X times faster.

Dive Into Code

JDK 8 - AtomicInteger
public final int getAndIncrement() {
        return unsafe.getAndAddInt(this, valueOffset, 1);

JDK7 - AtomicInteger
 public final int getAndIncrement() {
        for (;;) {
            int current = get();
            int next = current + 1;
            if (compareAndSet(current, next))
                return current;

JDK8 is using new function(getAndAddInt) from unsafe to do the magic. Unsafe has become more useful!

Dive in Assembly
To just confirm that all performance again is coming from fetch-and-add i had look at assembly generated.

JDK 8 
0x0000000002cf49c7: mov    %rbp,0x10(%rsp)
  0x0000000002cf49cc: mov    $0x1,%eax
  0x0000000002cf49d1: lock xadd %eax,0xc(%rdx)  ;*invokevirtual getAndAddInt
                                                ; - java.util.concurrent.atomic.AtomicInteger::incrementAndGet@8 (line 186)


0x0000000002c207f5: lock cmpxchg %r8d,0xc(%rdx)
  0x0000000002c207fb: sete   %r11b
  0x0000000002c207ff: movzbl %r11b,%r11d        ;*invokevirtual compareAndSwapInt
                                                ; - java.util.concurrent.atomic.AtomicInteger::compareAndSet@9 (line 135)
                                                ; - java.util.concurrent.atomic.AtomicInteger::incrementAndGet@12 (line 206)

Introduction of fetch-and-add type of feature in java will make it more suitable for high performance computing, we will see more wait free algorithm in java

Code used for testing is available @ AtomicCounterTest
Just compile for jdk7/8 and execute it.

Thursday, 16 January 2014

Java Queues - Bad Practices

Writing after long gap, looks like i lost the motivation or ran out of topic :-)

Recently while going through code of my current assignment, i noticed that java blocking/concurrent queue is used in worst possible way and that motivated me list down some of the bad practices to access queues.

 - Ignoring return value of "offer" function.
Blocking queues has various function that can be used to insert(add/offer/put) and offer is the special one
Details from java doc about offer

Inserts the specified element into this queue if it is possible to do so immediately without violating capacity restrictions, returning true upon success and false if no space is currently available. When using a capacity-restricted queue, this method is generally preferable to add(E), which can fail to insert an element only by throwing an exception.

Return value is very important for offer function. In many case returned value is ignored and it becomes mystery that why some message are not added and this makes application magical that some time it works.

Offer function is really useful function & java Executors framework is using for very good reason. It is used for rejecting task when task submission rate is higher than task processing, below is the code snippet from

So when you are using offer function never ignore return value. If you just want to throw some exception when ever queue is full then use "add" function, it has built in support to throws exception for capacity violation.

- Using IsEmpty/peek in spin loop
Blocking queue has lock based waiting strategy , it is used for blocking operation put/take , but it also has some function that are not blocking and if those are used in loop then it will cause heavy contention.

I found example where consumer was highly optimized , it was using isEmpty in loop to check whether some element is available or not and incase when there is an element then it will call take to retrieve element.

I don't know what developer was thinking , may be he was in very deep thought when such type of code was written.
Such consumer code caused dead lock, producer could not add element because it is unable to get lock and most of the time lock was acquired by isEmpty function.

For such type of code thread dump is also interesting, you will see producer blocked but no trace of consumer thread. Multiple thread dump are required to confirm this.

Blocking queue should be used in why it is designed to , in case if you use non blocking calls then you must have some backoff strategy to avoid dead lock.

- Single consumer calling take in loop

This is very common pattern, message are taken from queue one by one, since blocking queue is lock based it needs lock for each operation.
It is always better to get multiple items from queue in one call by using drainTo type of function, this will reduce contention will give significant performance gain.

- size function for heart beat
This is also very common case where size is use to print heartbeat message. Size is not cheap operation for blocking queue, it needs lock to get size.
Atomic variable can be used to keep track of queue size and read to atomic variable will not have any effect on queue performance.

- Using sleep waiting strategy with ConcurrentLinkedQueue
Code using sleep waiting strategy

I am sure you will have very good reason to use CLQ, but sleep spoils every thing because CLQ does't have blocking support , so many application will use sleep to avoid spin loop.

Sleep based approach fine in some scenario, but if you really want to build event based system then sleep does't help,you need some kind of notification mechanism, so when ever element is added to queue it will notify consumer about that.
It is pretty easy to add such support to CLQ. I created blocking concurrent queue to try few things,  executor-with-concurrentlinkedqueue blog has some details about it.

Each type of queue is designed for very specific use case and it should be used in proper way.
Some of the bad practice mention in blog makes application slow/unresponsive, bad usage pattern should be avoided.

Tuesday, 15 October 2013

Executor With ConcurrentLinkedQueue

In this blog i will explore some of waiting strategy for inter thread communication.

BlockingQueue is integral part of many concurrency framework including ExecutorService of Java.

Some of concrete implementation of BlockingQueue are ArrayBlockingQueue , LinkedBlockingQueue.
These queues are extensively used for inter thread communication, one of the complex aspect of blocking queue is Waiting Strategy.

Some of Waiting Strategy
 - Sleep Based - This is most naive way, sometime it can be best fit
- Spin  Based - This is best when you know cost of context switch is more than spin, good fit if you have have free core available.

- Hybrid ( Spin & Sleep both) - Hope to get best of both world!

- Lock based - Classic wait/notify type, JDK 5 has simpler API to achieve this using Condition. Locks & Condition has has hidden all the complexity required for inter thread communication. It is used in ArrayBlockingQueue, LinkedBlockingQueue and many more place in JDK

ArrayBlockingQueue Vs LinkedBlockingQueue
Both type of queue is using lock based waiting strategy but ArrayBlockingQueue is must faster as compared LinkedBlockingQueue because it is using Array as the underlying data structure.

 - It does all the memory allocation up front, so little less work for GC.
 - Linear memory access as compared to LinkedList. I shared my finding with cache friendly linked list in linked-list-revisited-for-in-memory blog

Throughput of Queues
Lets look at Throughput numbers for some queues. In this test i add 50 Million message to queue and 1 Producer & 1 Consumer is used.

I am using 3 types of queue in this test
 - LinkedBlockingQueue
 - ArrayBlockingQueue
- ConcurrentLinkedQueue - This is Lock Free queue , it is interesting to see number of this queue,since this is not a blocking queue, so i will be using Spin based waiting strategy on consumer side
Another reason to include CLQ is  later i will be using different waiting strategy with this queue to make it blocking on consumer side.

Y Axis - Throughput in 100K/Second
X- Axis - Type of Queue

Numbers looks as expected, concurrent linked queue is best

Now i will add some proper wait strategy to ConcurrentLinkedQueue and measure performance
 - Lock Based  - This one is similar to what ArrayBlockingQueue is using
 - Park Based - This is custom wait strategy inspired from Conditions of Lock, it is bit different than Lock

  •  Don't have to acquire lock to use it, so that overhead is removed
  • Have to specify number of waiters, it is similar to Concurreny Level of Concurrent Hashmap, it will create slot for each waiter and that is used when thread is parked. In case if  no slots is available then thread is parked for some time.

Throughput  - Comparison

Y Axis - Throughput in 100K/Second
X- Axis - Type of Queue

Some description about labels
Con Spin - ConurrentLinkedQueue using Spin
Con Lock- ConurrentLinkedQueue using Lock (i.e using Condition)
Con Park- ConurrentLinkedQueue using explicit thread parking

ConcurrentLinkedQueue using Lock based strategy suffers , performance of the queue is similar to ArrayBlockingQueue, so definitely not is good idea to to use Locks because all the benifit of LockFree queue is gone as Locks are introduced.

The one using custom Park that i described above out performs, it is around 150% faster than ArrayBlocking & around 13% faster than Spin based concurrent queue.

Executors Using ConcurrentLinkedQueue
Park based strategy which is like Condtions in JDK but without overhead of Lock seems better choice, but before we conclude anything , lets look at another test result.

I performed test on Executors by using different types of queue(ArrayBlockingQueue,ConcurrentBlocking With Park)
I also included WorkStealingPool to test which is included in JDK 8.

In this test 1 Million message is added by each producer to ExecutorService and slowly number of producer & threads in executor is increased.

Executor using ConcurrentLinkedQueue with Park Wait strategy seems best and in some case it is better than ForkJoinPool especially when number of Producer are increased.

Waiting strategy is complex, different use case might need different strategy to get best performance. This blog shows one of strategy which seems better than Lock based strategy that we have in JDK.

Code is available @ github

Sunday, 22 September 2013

Linked List Revisited for In Memory computing

Linked list is one of the first data structure that you learn when you start reading about data structure.
Linked List has its own advantage over other data structure like array, just to recap few of them

Some benefit of Linked List
 - Insert cost is low, just update address of new node
 - Memory Allocation cost is very low because it has to just allocate one node and attach it to last node
 - Very good data structure when you don't know how much data you will be storing, it grow on need basis, no wastage of space
 - Very low overhead when element is removed.

Things that are not good about Linked List
 - Access by Index is bad and not recommended .

  - Creates lot of work for Garbage Collector, can result in frequent Minor GC due which young objects will be promoted early to old generation

  - Extra Node Object is required for each element , so total memory required is (Node+Value) * No Of Objects. This can become problem if there is lot of objects.

 - Random memory access because of the way nodes are linked, this is major draw back because random memory walk is slow as compared to linear walk, CPU does lot of hard work to optimized memory access but Linked List or Tree based Data Structure can't use all those optimization due to random memory walk.
On NUMA based machine it become worse.

Do we need Linked List now ?
That is really tough question to answer, but the way Linked List performs it looks like it has no space in Data Structure Library.

With the current In-Memory trend, where application is keeping all the data in memory and data structure like Linked List are not friendly for that.

Lets Tweak Linked List for today requirement.
So today we need Linked List that has advantage of both Linked List & Array, that way we can get best of both world.

For testing i will create MultiValueLinkedList that has feature of both array & linked list. So it is like each Node contains small Array List and all these nodes are linked.
So you can say Array List linked via Linked List.

For performance test i add some millions of items in Linked List/MultiValueLinkedList/Array List and iterate over it. I start with 10 Million element and increase it by 5 Million.

Add Performance 
Lets look at some numbers for add operation in different types of list

X Axis - Time taken for add in Ms
Y Axis - No of Element added in Millions

MultiValueLinkedList is clear winner in this test, it is around 22X times faster than Linked List.

Linked List performance is so bad that graph scale is screwed up and it is difficult to make out difference between MultiValueLinkedList & Array List, so i will put another graph for MultiValueLinkedList & Array List

ArrayList Vs MultiValueLinkedList

This graph gives better understanding of performance of Array List & MultiValueLinkedList, new linked list is 2.7X times faster than Array List.

 Iterate Performance
Look at iterate performance before i try to explain reason of performance boost

X Axis - Time taken for iterate/access in Ms
Y Axis - No of Element in Millions

As i said earlier that new Linked List will be hybrid one , it will take best from Linked List & Array List and this graph confirms that , it is somewhere in between Linked List & Array List.
MultiValueLinkedList is around 1X times faster than linked list

Garbage Collections
This test is all about allocation and will not be completed without GC activity graph, i want to put more GC observation but that will make this blog big, so i will put just one Visual GC graph

Visual GC Graph for Linked List

Visual GC Graph for MultiValueLinkedList

I don't have to explain much in these GC behavior, for Linked List minor GC activity is so much that object are promoted to old generation, all though it is not required.
GC behaves as expected for MultiValueLinkedList.

 - Performance of add is best for MultiValueLinkedList & it is due to less number of allocation & almost no need to copy the Array as it is required for Array List whenever it grows

 - Garbage produced by MultiValueLinkedList is very less. Memory usage of MultiValueLinkedList is same as ArrayList & for LinkedList it is double, that explains why we see object promotion to old generation.

- Iteration of MultiValueArrayList is slow as compared to Array List because memory access pattern is like Page by Page, it is possible to improve it further but adjusting the size of element in nodes.

Code is available @ github

Saturday, 14 September 2013

Concurrent Counter With No False Sharing

This blog is continuation of Scalable Counter post.

One of the reader shared result from his system. He ran test on XEON intel processor with 16 core and total time taken for each type of counter is almost same, although Atomic counter has CAS failure & other type does't has any CAS failure but it does't make any difference in execution time.
Very strange result , needs further investigation.

Another reader pointed out that it could be due to false sharing, so it worth to take that into account and i created another class that handles FALSE SHARING

Time take by different counter
 Y Axis - Time taken to increment 1 Million times

X Axis - Number of threads

PaddedAtomicCounter is new type of counter that i have added to test & it it outperforms all other counters.
It is using cache line padding to avoid false sharing.
Cacheline is of 64 byte on most of the today processor. PaddedCounter it Integer based counter so it is adding 16 slot per counter, by using this techniques we avoid cache pollution & as result we see 16X times gain as compared to AtomicCounter, without padding the gain was 5X and with cache line padding gain jumps to 16X.
With cacheline padding you need some extra space, so it is trade off of memory for speed, you can choose what you want!

CAS failure rate
Lets look at the CAS failure for different counter & what it means for performance.

Y Axis - CAS Failure in 100Ks

X Axis - Number of threads

PaddedAtomic has some CAS failure as compared to other counters , but it does't make any difference in execution time of the counter.
CAS failure is not the only factory that can determined execution time, false sharing make significant contribution to it, so this gives good explanation of behavior seen in XEON processor

To get better performance you have to take care of few things

 -  Contention  - There are many techniques to avoid it , this blogs shows one of them

 -    False Sharing - You have to avoid false sharing to get best out of processor, padding is required for that. Some of the JDK classes are using padding are ThreadLocalRandom , now we have @Contended annotation from java to achive same thing, it is being used in ForkAndJoinPool

Code is available @ github


Tuesday, 10 September 2013

Scalable Counters For Multi Core

Counters are required everywhere , for e.g. to find key KPI of application, load on application, total number of request served, some KPI for finding throughput of application & many more.

With all these requirement complexity of concurrency is also added & that makes this problem interesting.

How to implement concurrent counter
 - Synchronized - This was the only option before JDK 1.5, since now we are waiting for JDK8 release , so definitely this is not an option.

- Lock based  - You should never attempt this for counter , it will perform very badly

- Wait Free - Java does't have support for Fetch-and-add, so bit difficult to implement it.

- Lock free - With very good support of Compare-and-swap, this looks good option to use.

How does Compare-and-Swap based counter performs

I used AtomicInteger for this test and this counter is incremented for 1 Million time by each thread & to increase the contention number of threads are increased gradually.

Test Machine Details
OS : Windows 8
CPU : Intel i7-3632QM , 8 Core
RAM : 8 GB

Y Axis - Time taken to increment 1 Million times

X Axis - Number of threads

As number of threads are increased, time taken to increment counter increases and it is due to contention.
For CAS based counter , it is CAS failure that causes slowdown.

Is this best performance that we can get ? no definitely their are better solution to implement concurrent counter, lets have look at them.

Alternate Concurrent Counter
Lets look at some of the solution to implement counter that handles contention in better way

- Core Based Counter - Maintain counter for each logical core, so that way you will have less contention. Only issue you have this type of counter is that if number of threads are more than logical core then you will start noticing contention.
- Thread Based Counter - Maintain counters for total number of threads that will be using system. This works well when number of threads are more than number of logical cores.

Lets test it

Time taken by different types of counter

Y Axis - Time taken to increment 1 Million times

X Axis - Number of threads

Concurrent Counter performs much better than Atomic based counter, for 16 threads it is around 5X times better, that is huge difference!

CAS Failure Rate

Y Axis - CAS Failure in 100Ks

X Axis - Number of threads

Due to contention, Atomic based counter see lot of failure and it goes up exponential as i add more threads & other counters performs pretty well.

  Multi core machines becoming easily available & we have to change the way we handle concurrency, traditional way of doing concurrency is not going to scale in today's time when having 24 or 48 core server is very common.

 - To reduce the contention you have to use multiple counters and then aggregate them later

 - Core based counter works well if number of threads will be less or same as number of cores

 - Thread based counter is good when number of threads are much more than available core

 - Key to reduce contention is identify counter to which thread will write,i have used simple approach based on thread id, but much better approach are available, look at ThreadLocalRandom of JDK 8 for some ideas.

 - Thread based approach is used in LongAdder of JDK8, which creates many slots to reduce contention.

Code for all the counters used in this test are available @ Github