Wednesday, 25 October 2017

Data structure for big data

Today crunching Terabyte of data is very common and it brings up interesting challenges.

Memory is getting cheaper and it is enabling application to keep all the data in memory and process it but some data set are very huge and does not fit in single machine, incase it fits in memory but latency of processing will be not a acceptable.

Lets take a example to understand this

You are building web analytics product that collects all the clicks/page load etc events, each event object will have some of these attributes.

 Volume generated by such type of tracking is huge and to ask any interesting question on such data set is resource intensive.   

Lets look at some of the resource intensive questions 

- How many distinct ip address are present
- How many distinct visitor Id
- Top 10 Pages
- Top X Visitor
- How many times Top X pages are requested
- Any page requested by blacklisted IP or user 

Simple way to solve distinct query is to build set and count the items in set
Building set is going to take memory proportional to number of distinct element and set has some load factor to make sure read/write operation are constant time. So memory wise it is super expensive.

Key insight to solve the problem is how accurate distinct count should be , it is OK to have some error(i.e 5 to 10%) and if answer is yes then some magic can done.

Since this problem is about just the count , so we don't have to maintain real value and that will give huge saving, algorithm needs some way to track  whether current value is seen or not.

at high level algorithm is 
 - Compute the hash
 - Find the index in bit vector
- Mark the vector if is not set and increment the distinct counter.

It is very simple algorithm and does the job, it is called Linear Counting.
This works well but it has some limitation that you need to workout size which controls accuracy to get best result it should be close to number of distinct values you expect to see.

There is another class of algorithm called LogLog/Hyper LogLog which more powerful than Linear Counting.

Hash value is represented as binary value and then it is split in sub-string
One part is used to extract which bucket value should go
Next part is used to calculating value and it is done by counting no of leading zeros.

By using this technique each entry is put in different buckets for eg values starting with 01- Goes in X, values starting 001 goes in Y bucket.

To compute the final value all the buckets values are taken and it is averaged by applying some factor.

By using this technique billion elements can be counted by using very less memory.

HyperLogLog paper contains all the details of algorithm.
Linear Counting or Hyper Log can be used to answer below questions
- How many distinct ip address are present
- How many distinct visitor Id
- Distinct visitor by region, time etc
- Trending words like google trends
- Near similar document matching to remove duplicate document, it is used by search engine.

This data structure can be used for quick anomaly/fraud detection.
This also can be used for Nested_loop_join , spark is using it in ML lib for MatrixFactorizationModel and few more places to count keys in RDD.    

One of the nice properties of these data structure is merge operation.
for e.g. one hyper log can be maintained for unique visitor for each day and they can merged to get value for any time range.

Lets look at some memory usage numbers, it is computed using stream-lib
In this benchmark i used all the words from HarryPotterseries book.

Actual Count 11983

Type Memory ( KB) Count Error %
Set 431 11983 0
LinearCounting 31 12167 1.5
HyperLog 14 12253 2.2

In less than 1% of memory with 2 % of error this data structure is able to do cool thing !

Another interesting data structure for membership query is bloom filter, i wrote about in mr-cool-bloom-filter blog post.

Code used for bench-marking is available @

Saturday, 16 September 2017

Unix design patterns and philosophy

Design patterns got very famous after release of Design-Patterns-Elements-Reusable-Object-Oriented book and it became one of the advance topic of software engineering.

It become one of the must book to have it on desk, it also gave common vocabulary to talk about design. Design pattern idea was extended to capture solution to recurring problem. 

Linda Rising took same idea and wrote book on Fearless Change Patterns Introducing Ideas , she wrote part2 of book More Fearless Change and i am sure lot more books will come in future which will present patterns idea in different shapes & forms.

In this blog i will share design patterns used in UNIX.
All the patterns are 30+ year old but are so much relevant in microservices time.

Lets start with famous quote from  "C. A. R. Hoare"

"There are two ways of constructing a software design. One is to make it so simple that there are obviously no deficiencies; the other is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult."

The Emperor’s Old Clothes, CACM February 1981
—C. A. R. Hoare

Bottom up design

First design philosophy is unix is build using bottom up approach, it is build from lego blocks, built small pieces and then compose them to create something more useful. Functional programming is based on this principal !

Make each program do one thing well 

    This is such a simple rule but not used enough, developer loves complexity and this rule is violated everyday and results in software that does too many things but many features are not implemented properly.

Some of the example from UNIX of one thing well are grep, diff , patch , Yacc.
These programs are bugfree and so much that functionality is taken for graunted.

Design programs to be connected with other programs(Composition)

This is the rule that invented PIPE (i.e | ) , if our program are build using this rule then lot of time/money spent on integration can be saved.
In Unix world binary input/output is avoided and this makes integration so easy.
Unix program separate interface from logic, interface is "Text" stream and produces "Text" stream.

Each program is independent and they can be composed together because they follow simple rule of "text" interface.
As a developer we like to do premature optimization and start with binary input/output, use binary format only when bottleneck is proved.


Build software that can be tried early may be in in few days or 1 week. 

This philosophy is also applicable for complex software like Operating System/ Device Driver etc, today we have to take help of Agile to build software that can be tried in 2 or 3 weeks.
Unix guys were doing Agile in 1970.

Avoid Fancy algorithm

Fany algorithm are complex to implement and have bugs and gives dividend only when N is above some threshold, by using simple algo and data structure lot of complex problem can be solved.

Data dominates 
Spend more time on building data structure that will organize data properly.

Write simple parts connected by clean interface 
This is the most difficult on to get right, we have seen different programming paradigm starting form assembler , structural , object oriented , functional etc but all of them have failed, as soon as program turns from POC to real it can't fit in head of human brain.

Clarity over Cleverness 

I am sure every developer will have some war story to tell about debugging Clever program.

Rule of Transparency
This is one the things that is thought  on last day of development in-spite of knowing that this rule is to make inspection and debugging of program easier.

Rule of Repair

When you fail, fail noisily and as soon as possible.

Single Point of Truth(SPOT) rule

This rules says that every piece of knowledge must have single, unambiguous representation in system. Repetition leads to inconsistency so use Constants, tables, metadata, code generator . Complicity is cost , don't pay it twice.  

All the philosophy boils down to

Monday, 24 July 2017

Java Intrinsic magic

I got question on atomicinteger-java-7-vs-java-8 blog about implementation of unsafe.getAndAddInt function

It looks like JDK8 is still using CAS version.
If you just look at the code in IDE then you are correct , but JVM is intelligent hotspot compiler.

Code goes through various optimization before it is executed, optimization list is huge and it will required blog series to just touch on it.

In this blog i will discuss about Intrinsic function

As per WikiPedia 

In compiler theory, an intrinsic function is a function available for use in a given programming language whose implementation is handled specially by the compiler. Typically, it substitutes a sequence of automatically generated instructions for the original function call, similar to an inline function. Unlike an inline function though, the compiler has an intimate knowledge of the intrinsic function and can therefore better integrate it and optimize it for the situation. This is also called builtin function in many languages.
Compilers that implement intrinsic functions generally enable them only when the user has requested optimization, falling back to a default implementation provided by the language runtime environment otherwise.

So in other words, some function are just like native function, it is different from native function written by us.

Compiler makes decision to use native/special function if is available other wise if will fallback to default implementation.

Unsafe.getAndAddInt is special function that gets advantage of optimization, but normal code is
still required for case when optimization is not requested or not available

To verify this we have to look at Assembly code generated.

how-to-print-dissasembly-from-jit-code article has all the details you need for this.

You can also download dll/lib for windows/linux from github

For below code snippet, optimization kicks in

Assembly code for above code looks like below ( look at line number 7, it is xadd)

Java maintains list of all intrinsic functions in code @ vmSymbols.hpp

Hotspot compiler is full of magic:-)

Tuesday, 27 June 2017

Programmer Oath

Uncle Bob is very famous for his ideas on Clean Code. He is very hardliner on Test driven development.

I am big follower of him and have read few books Clean Code , Clean Coder  written by him, i have become better programmer after reading his books.

I read Clean Coder few years back and felt that this book is must read for every professional software developer.

Uncle bob recently gave a talk The Scribe's Oath , that summaries the whole book in 1 hour.

In this talk he sets nice context before getting to real stuff. He talks about couple of Oaths for programmer.

Not writing harmful code 
He talks about harm software developer does by releasing defect , makes code harder for other programmer to understand. Software should be easy to change and anything that makes hard to change is harmful code.

Code that i will produce will be my best work
This is something that you are proud of.

Each release with quick , sure & repeatable proof that every element of code works as it is supposed to.
This is all the TDD, BDD, CI etc thing comes in. It is inappropriate to accept some level of defect.

Frequent small release and not impede progress
Holding code to yourself and not let anybody seeing it, holding on to big commit are few things that will impede progress.  Small releases are better.

Fearlessly and relentlessly improve the code at every opportunity and never make it worse.
Almost every software developer is slowed down by bad code , but we still continue to write bad code for some reason. We allow bad code to rot in absence of automated Unit test. This is where The Boy Scout Rule is useful . Automated test are really useful to keep this promise.

Keep my and my team productivity high.
Don't hold on the big commit, don't create test that takes hours to run.

Continuously ensure that others cover for me and i cover for them.
This one is interesting, he talks about working like sports team and handling situation when one of the team member is injured or down.
Pair Programming can be good tool to achieve this.

Produce estimate that are honest both in magnitude and precision, no promise without certainty
This is tricky one, most honest estimate is "I DON'T KNOW" but this does not work in real world and we fall in estimate trap and make false promise. He recommend to draw curve around uncertainty.
You manager will come to you and say that we need this in next 2 days , can you try ?
Your answer to this should be i am already trying and not holding back any extra capacity. You should say no, when it is not possible .

Never stop learning and improving my craft.
You have to keep on leaning new language, framework, library otherwise you will fall behind and have to become manager :-)
Links to some of Uncle bob talks that i found good

Effective Estimation (or: How not to Lie)
The Principles of Clean Architecture
What is OO really
SOLID Principles of Object Oriented and Agile Design

Thursday, 25 August 2016

Lazy evaluation

Recently i was writing log4j appender and wanted to use logger in it to log some diagnostic details during custom appender creation, but log4j initialization completes only after appender instance are created, so message logged during this phase are ignored.

I felt the need for lazy initialization in custom appender and started to look at options.

In this blog i will share things that i tried.

One of the thing that came to my mind was Singleton approach but now it is known fact that singleton causes problem with testing and make it impossible to extend it, so approach of mixing concurrency & object construction is not that good.

Incase if singleton is required then it is better to use Dependency Injection framework rather than spoiling your application code.

Lets get back to lazy initialization/eval.

Some programming language like scala/swift etc has support for lazy, so no custom code is required to do this but in java space we still have to write thread safe code to get it right.

Lets look at some options we have in java and what type of performance we get.

- Brute force using Synchronized
This is the most simple and inefficient one, scala is using this approach. Scala one is available @

- Double lock
This is little complex to write and gives good performance.

- Using Future task
This approach is simple to write and gives good performance.

Double lock approach gives the best performance and brute force one is worst. I did quick bench mark for 1 Million calls using different number of thread.

Single lock performance is very bad, lets have look at the number by removing single lock to see how Double Lock & Future Task performed.

These benchmark are done very quickly but detailed benchmark numbers should be close.

Code for this blog post is available @ github

Sunday, 19 June 2016

Java Arrays Sort decoded

Sorting is one the first algorithm that we learn in computer science. Sorting is such an interesting area that it has around 20+ algorithm and it is always difficult to decided which one is best.

Sorting algorithm efficiency is measured in terms of time taken & space required.

Some time bubble sort is best because it has no space requirement and for device where space is constraint or random access of element is not possible then it can be good fit .

These days we tend to use library sort functions, most of language library sort function is adaptive and it uses best algorithm depending on size of data.

In the blog i will share how those decision are made in java Arrays.sort function.

Decision are based on Data Type and Size

 - Byte 
For byte java API decides between Counting Sort or Insertion Sort.

If size of input array is less than 29 then insertion sort is used, visualization of insertion sort
For large arrays Counting sort is used and it is based on fact that range of byte is -128 to 128 and it used as advantage to do quick sort.

Counting sort has little memory requirement and insertion is in place, so over all not much allocation is done and it will keep garbage collector happy when byte array is sorted.

- Char
For char decision is between Counting Sort & Dual Pivot QuickSort

If size of input is greater than 3.2K then counting sort it used and it allocates array of 65K size to implement sorting.

For smaller array Quick Sort variant using Dual pivot is used , visualization of quick sort.

QuickSort used is also in-place , so memory wise not much load on Garbage Collector.

- Integer/Long

For integer/long things gets interesting as Merge Sort makes entry.

For input less than 256 , two options are available
 - If input is less than 47 then Insertion Sort and for other case Dual Pivot QuickSort is used.

For large input array there are some good edge case checks
 - If arrays is already sorted either Ascending or Descending then it is single loop to check that.

 - If element of array are same then Quick Sort is used as it works best in such case.

 - Or if element are really jumbled like for e.g like each even element is greater then odd element then it will use Quick Sort.

and at last all this checks fails then Merge Sort is used and new array of same size is allocated and sorting is performed. Quick refresher of Merge Sort

Important thing to note about Integer sort is that if Array is already sorted then no memory is allocated and if QuickSort kicks in memory allocation is under check.

- Float/Double

Float has special optimization for NAN, all the NAN are moved to end of array and they are skipped from sorting.

Once NAN values are handled then sorting goes through same checks as INTEGER data type.

- Sorting on Objects

Collection sorting has little different rules, for collections it is only between Merge Sort & Timsort.
By default Timsort is used which is mix of merge & insertion sort .

Merge sort is more over decommissioned and it is only used when "java.util.Arrays.useLegacyMergeSort" flag is switched on.

In JDK 8 parallel sort options are also added which are based on input size of array, for arrays with size greater than 8K then parallel version of sort is used.

Friday, 8 April 2016

Binary Search

This blog is part of series that i wanted to start Preparation of Coding Interview

Binary search is one of the first search algorithm that we learn, it is simple to implement and very powerful, gives 0(log n) guarantee.

Quick recap of binary search


It is simple to implement but nearly all binary search were broken due to one bug for long time, you can read about it @ Nearly All Binary Searches are Broken  

Enough of history lets get back to topic and discus some of the interview question on this 

 - Implement Binary Search

This question is still asked so you should practice it because all the logic around index manipulation is little tricky.

Look the line that is finding MID, it is the line that had the bug

 -Find rotation point

This is small variation of binary search, you have to find rotation point in ordered array, for e.g 

Elements are in increasing order then it drops to min value and then again elements are in increasing order.

How do you define rotation point ?
for any element if both left & right element are greater then current element is at rotation point.

In above case 5th element is at rotation point because both left(4th) and right(6th) are greater than 5th

Another example 
char[] values = {'i', 'j', 'k', 'v', 'a', 'b', 'c', 'd'}

5th element(i.e 'a') is at rotation point

Simple Binary Search will not work on this because 2 sorted array are joined in this.
If you try to solve this by brute force then have to go through each element till you find element that is less than previous.

Brute force solution will look something like below

Since this array is sorted , so it is possible to write hybrid binary search that can take advantage of sorted data.

Now real question is how to decided when to go left or right ?

If we just apply normal binary search logic then mid point is 'v'(i.e 4), now which element should be compared to decided left/right ?

Ah it is the first one ('i') , if mid is greater than first then go to right or if it is less than first then go to left.

Lets put some picture to understand it.

 Step 1 - v > i , so we go to right
 Step 2 - b < i, so we go to left
 Step 3 - mid value (a) is at rotation point and program ends.

Code will look something like below

Hopefully visualization was ok enough to explain how to approach this problem.
My visualization skills are very bad & i have to work on it.