Sunday 4 February 2024

Demystifying Vector Databases: The Magic of Meaningful Search

What is vector database ? 


The digital world is awash in unstructured data. text documents, social media posts, images, videos, audio recordings, and more. While traditional database excel at storing and retrieving neatly organised data, they struggle with this messy, ever-growing sea of information. Enter vector databases, a new breed of database designed to unlock the hidden meaning within unstructured data.

While Large Language Models (LLMs) have brought vector databases to the forefront, their applications extend far beyond this exciting field. Recommendation systems use vector databases to suggest products you might like based on your past purchases and browsing history, even if you haven't explicitly searched for those items. Fraud detection systems leverage them to identify suspicious patterns in financial transactions, helping catch anomalies that might slip through traditional filters.


But how do these databases work their magic? It all starts with a clever trick: representing data as multi-dimensional vectors, essentially numerical lists. Imagine every data point as a location on a map. Nearby points on the map represent similar data, regardless of the original format (text, image, etc.). This is achieved through techniques like word embeddings, where words with similar meanings are mapped to close points in the vector space.

Traditional keyword-based searches often miss the mark. Imagine searching for "small, fleshy, and seedless" fruits. No exact match exists, leaving you frustrated. But a vector database understands the underlying meaning of your query.

It finds data points closest to the "small, fleshy, and seedless" vector, leading you to grapes or kiwis, even though those words weren't explicitly used. This semantic search capability unlocks a new level of data exploration and analysis.


Search - Legacy vs Semantic



 

How vectors are created ?

But how do these magical numbers come to life? Enter embeddings, numerical representations of data points created by deep learning models. Imagine feeding a vast collection of text documents to a sophisticated neural network. It analyses the relationships between words, their context, and their usage, eventually generating unique vector representations, or embeddings, for each word. These embeddings capture not just the literal meaning of the word, but also its nuances and semantic connections




Generally, the last layer of deep learning models focuses on specific tasks like prediction or classification. But the true treasure trove of knowledge lies in the second-to-last layer, often called the bottleneck or hidden layer.

This layer holds a condensed representation of the input data, capturing the essential features and relationships learned during training. By strategically removing the last layer and accessing the information in this penultimate layer, we can extract vector embeddings that encapsulate the model's understanding of the data.

Higher dimensionality captures more information but requires more storage and computation, while lower dimensionality is space-efficient but might miss some nuances.

The key is to find the right balance between the dimensionality (size) of the embeddings and the desired level of detail.

Forget training your own model! The world after Chat GPT offers a wealth of ready-made embedding models






How to get embeddings?





Use case vectors solve?

Get ready to explore the diverse problems solvable with vector embeddings! These powerful representations go beyond text, unlocking:

1. Semantic Search: Dive deeper than keywords. Find images, videos, or audio similar to your intent, not just literal phrases. Imagine searching for "peaceful nature scene" and discovering breathtaking waterfalls instead of generic landscapes.

2. Data Similarity Search: Uncover hidden connections across non-text data. Quickly identify similar products, faces, or even medical scans, regardless of format.

3. Personalised Recommendations: Get suggestions that truly understand you. Vector embeddings power recommendation systems that learn your preferences and suggest items you'll genuinely love, not just similar purchases

4. Retrieval-Augmented Generation (RAG): Bridge the gap between information retrieval and generation. Leverage vector embeddings to create summaries, translate languages, or even write different creative text formats based on specific requests. This is number #1 application of LLM powered apps.

5. Fraud and Anomaly Detection: Spot suspicious activity faster. Vector embeddings help identify unusual patterns in transactions, financial data, or even network traffic, leading to improved security and fraud prevention.

6. Search Result Ranking: Get the most relevant results first. Embeddings power search engines to understand your intent and rank results based on meaning, not just keyword matches.

7. Efficient Clustering: Group similar data points effortlessly. Vector embeddings enable efficient clustering of large datasets, revealing hidden patterns and facilitating further analysis.

And that's just the beginning! The potential of vector embeddings continues to expand, promising exciting solutions in areas like drug discovery, social network analysis, and more.


 

How Vector database uses vector?

Let's explore their first superpower: semantic similarity. Unlike traditional keyword searches, vector databases understand meaning.

You can input a vector, and the database returns vectors representing the most similar meaning content, not just exact matches.

This is classic example from popular paper written in 2013 - Efficient Estimation of Word Representations in Vector Space



Several algorithms can be used for calculating vector difference, each with its advantages and limitations depending on the specific application and data characteristics. Here are some common ones:

Jaccard similarity
This compares the proportion of shared elements between two binary vectors (containing only 0s and 1s), often used for comparing sets or sparse data.



Hamming distance 

Between two strings or vectors of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other, or equivalently, the minimum number of errors that could have transformed one string into the other



Euclidean distance: This is the most straightforward and intuitive method, calculating the straight-line distance between two points in multidimensional space. It's computationally efficient but sensitive to data scaling and dimensionality.

Manhattan distance: This measures the distance by summing the absolute differences between corresponding elements of the vectors. It's less sensitive to outliers than Euclidean distance but not as intuitive for representing geometric similarity.










Inner Product : This method is a mathematical operation that measures the degree of similarity or alignment between two vectors. It tells you how "close" two vectors are in the multidimensional space they inhabit





Cosine similarity: This method measures the angle between two vectors, reflecting their directional similarity. It's independent of magnitude and useful when interpreting vectors as directions rather than exact positions.



Conclusion

This post delves into the fascinating world of vector databases, equipping you with a solid understanding of their core concepts, vector creation methods, and similarity search algorithms.

In the next section, we'll dive into the unique storage and retrieval mechanisms employed by vector databases. Unlike traditional databases that rely on B-trees or hash indexes, vector databases utilize innovative approaches specifically designed for efficient vector searches. Get ready to explore a whole new level of data exploration!







Monday 14 August 2023

Multi Indexing

Imagine you are the owner of a grocery store, which stocks a wide variety of products such as bakery items, dairy products, eggs, grains, vegetables, oils, frozen foods, and more.




Now, you want to create a hybrid index that allows for fast lookups based on several attributes of the products, including:

  • Product name or ID
  • Product expiry date
  • Discount percentage
  • Fast-moving/selling items

In essence, this index will enable both simple lookups and more advanced functionalities like ranking products.

One crucial distinction to note between these types of lookups is that:

  1. A search by product name or ID will yield unique items, meaning each product is listed only once.
  2. A search by expiry date, discounted products, fast-moving items, etc., will return results ordered by their respective values, and the values are not necessarily unique.

With this hybrid index in place, you'll be able to efficiently manage your inventory and cater to customer demands more effectively. Whether it's quickly finding a specific product or identifying the most popular items, this system will streamline your grocery store operations and enhance the overall shopping experience.





Which data structures ?


Databases can effectively manage these types of access patterns through the implementation of indexes on the query columns and the strategic use of the 'ORDER BY' clause. However, it's important to note that this approach comes with its own set of trade-offs, which we'll explore in a future post.


For now, let's direct our attention towards determining the most suitable data structure for the task at hand.

When faced with the challenge of efficiently handling two distinct access patterns - key-value lookup and ordered lookup by some attribute - traditional data structures like Maps, Binary Search Trees, or Priority Queues may not individually fulfil all the requirements. As a solution, we can create a new data structure that combines the strengths of these structures into one, calling it Treep (Tree with priority).

Treep will have the following characteristics:

  1. Key-Value Lookup: Like Maps, Treep will allow fast key-value lookups, enabling quick retrieval of data using unique identifiers.

  2. Ordered Lookup by Attribute: Similar to Binary Search Trees and Priority Queues, Treep will maintain the data in a sorted order based on a selected attribute. This will enable efficient access to elements in a specific order (e.g., ascending or descending) with respect to that attribute.


The interface of the Treep data structure may look like below

public interface Treep<K, V> {
void add(V value);

Set<String> orderBy();

V get(K key);

V top(String attributeName);

V takeTop(String attributeName);

V delete(K key);

Iterator<K> keys();

int size();
}

get - This method efficiently handles the simple key-based access pattern. By providing a key as input, you can swiftly retrieve the associated value from the Treep. This operation is ideal for quick lookups of specific elements using their unique identifiers.

top/taketop - This method caters to the access pattern of retrieving elements ordered by a specific attribute.



Code snippet using Treep

Treep<String, Product> stores = new MapPriorityQueue<>(Product::name, Map.of(
"discount", Comparator.comparing(Product::discount).reversed(),
"price", Comparator.comparing(Product::price).reversed()
));


Arrays.asList(
Product.of("AXION Yellow", 2.12f, .10f),
Product.of("Meji Fresh Milk 2L", 6.9f, 0.0f),
Product.of("red Chilli 100 G", 1.14f, .05f),
Product.of("Fresh Cucumber", 1.37f, .01f),
Product.of("China Garlic", 1.93f, 0.0f),
Product.of("Red Onion", 1.19f, 0.07f),
Product.of("Fuji Apple", 3.14f, .11f),
Product.of("Banana", 3.58f, .12f)
).forEach(stores::add);

assertEquals(Product.of("Banana", 3.58f, .12f), stores.top("discount"));
assertEquals(Product.of("Meji Fresh Milk 2L", 6.9f, 0.0f), stores.top("price"));


MapPriorityQueue is using Map and Priority queue to implement multi index feature.






Full working code is available @ github. There are other ways to implement such data structure by using Map + SortedSet , Map + SkipList


Saturday 15 July 2023

More choice using ChoiceFormat

The conversion of numbers to text is a frequently encountered problem for engineers. It manifests in various scenarios, and some of the most prevalent examples include:

  • Converting a number to a day of the week
  • Converting a number to a month
  • Transforming a status into user-friendly text

Now, let's explore some solutions for addressing this problem, using the conversion of a number to a day of the week as an explanatory example.







- IF/Else/Switch


public static String toDayOfWeek(int value) {

if (value == 1) {
return "MON";
} else if (value == 2) {
return "TUE";
} else if (value == 3) {
return "WED";
} else if (value == 4) {
return "THU";
} else if (value == 5) {
return "FRI";
} else if (value == 6) {
return "SAT";
} else if (value == 7) {
return "WED";
}
return "You are on moon";

}


While this solution may serve as a decent starting point, it is worth noting that even your grandfather might not approve of it.

- Maps


public static String toDayOfWeek(int value) {

Map<Integer, String> dayOfWeek = new HashMap<>() {
{
int index = 1;
put(index++, "MON");
put(index++, "TUE");
put(index++, "WED");
put(index++, "THU");
put(index++, "FRI");
put(index++, "SAT");
put(index++, "SUN");
}
};

return dayOfWeek
.getOrDefault(value, "You are on moon");

}


This solution appears to be an improvement and can be considered a good option, especially when the key is dynamic. However, it is important to note that maintaining this solution can become challenging over time.


- Enums


public enum DayOfWeek {
MONDAY,
TUESDAY,
WEDNESDAY,
THURSDAY,
FRIDAY,
SATURDAY,
SUNDAY;
private static final DayOfWeek[] ENUMS = DayOfWeek.values();

public static DayOfWeek of(int dayOfWeek) {
return ENUMS[dayOfWeek - 1];
}
}

This approach seems quite elegant and aligns with the way JDK handles similar scenarios. It offers several advantages, such as leveraging data types to ensure compile-time safety. This not only enhances maintainability but also provides the benefit of catching errors during compilation. However, one potential drawback of this approach is that it can pose challenges when it comes to extending functionality due to the strong type safety constraints.

- Choice Format


The choice format is a relatively new feature introduced in JDK 17+, and it offers intriguing solutions for tackling this problem. Before delving into the intricacies of how it works, let's examine some code examples to get a better understanding.

public static String toDayOfWeek(int value) {
double[] limits = {1, 2, 3, 4, 5, 6, 7};
String[] formats = {"Mon", "Tue", "Wed", "Thur", "Fri", "Sat", "Sun"};

ChoiceFormat form = new ChoiceFormat(limits, formats);
return form.format(value);
}

This solution solves the problem by employing a straightforward approach: using pairs of arrays or vectors to map numbers to their corresponding strings.

At first glance, it may not seem particularly remarkable, resembling a Map structure where the keys are represented by one array and the values by another. However, the true test begins when we pass values that are not present in the array.

Now, let's speculate. What do you think this function would return if we were to pass 0 or 100? One possibility could be "Undefined," which would hold true if this were JavaScript. Another option could be null or a NullPointerException, which is a close guess given your familiarity with Java. However, this is where the solution gets interesting.

Lets look at the output for 1/2/0/100

1 -> Mon 2 -> Tue 0 -> Mon 100 -> Sun


The output obtained from this solution provides some valuable insights into how ChoiceFormat operates. For example, when we pass 0, it returns "Monday," and when we pass 100, it returns "Sunday."

Based on these results, you might start getting an idea of how ChoiceFormat functions. It relies on a few key elements:

  • Ascending limits array: The limits array is arranged in ascending order, defining the intervals.
  • Format array: This array has the same size as the limits array and contains the corresponding text representations for each interval.
  • Interval Behaviour: This values in limit array represent half-open interval, meaning that the lower limit is inclusive while upper limit is exclusive.

These factors play a crucial role in determining the appropriate text representation based on the input value within the defined intervals.

Lets look at half-open interval match with below function


public static String weekDayOrWeekend(int value) {
double[] limits = {1, 6};
String[] formats = {"WeekDay", "Weekend"};

ChoiceFormat form = new ChoiceFormat(limits, formats);
return form.format(value);
}


This particular implementation exhibits an interesting behavior: for values less than 5, the function returns "Weekday," while for values 6 and above, it returns "Weekend."

Isn't it fascinating how ChoiceFormat manages to accomplish this range-based search and deliver the appropriate result? It's remarkable how this small utility class can perform such a useful trick.

Let's consider one more simple example before delving into the greater capabilities of this utility class.


public static String workDays(int value) {
double[] limits = {1, 2, 5, 6};
String[] formats = {"Monday Blues", "WorkHard", "It is Friday!!", "Relax"};

ChoiceFormat form = new ChoiceFormat(limits, formats);
return form.format(value);
}




This should give fair bit of idea that this class will be useful in many places.

Lets look at some application

- Better log messages

public static String files(int value) {
double[] limits = {0, 1, 2};
String[] formats = {"No files", "One files", "Many files"};

ChoiceFormat form = new ChoiceFormat(limits, formats);
return form.format(value);
}

 

System.out.println("Found " + files(100)); //Found Many files
System.out.println("Found " + files(0));//Found No files
System.out.println("Found " + files(1));//Found One files

- Conditional Log message 

public static String formatMessage(String format, int value) {
ChoiceFormat form = new ChoiceFormat(format);
return form.format(value);
}



String format = "0#no files | 1#one file |2# two files |3< more than 2 ";
System.out.println(formatMessage(format, 2)); //two files
System.out.println(formatMessage(format, 10)); //more than 2
System.out.println(formatMessage(format, 0)); //no files
System.out.println(formatMessage(format, 1)); //one file


This example showcases the power of advanced string interpolation by utilizing rules embedded within the format string.

By leveraging this technique, we can define rules directly within the format string itself, which provides a flexible and concise approach to handle various scenarios.

- Parameterised Conditional Log message 

Multiple ways to do this, lets look at few example.


public static ChoiceFormat usingPair() {
double[] priceLimits = {0.0, 10.0, 50.0, 100.0};
String[] priceFormats = {
"The item is not available",
"The item is on sale for {0}",
"The item is moderately priced at {0}",
"The item is expensive at {0}"
};
return new ChoiceFormat(priceLimits, priceFormats);
}


public static ChoiceFormat usingStringLiteral() {
return new ChoiceFormat(
"0#The item is not available |10#The item is on sale for {0} |50#The item is moderately priced at {0} |100#The item is expensive at {0}");
}


public static ChoiceFormat usingStringRules() {

String rules = String.join(" |",
"0#The item is not available",
"10#The item is on sale for {0}",
"50#The item is moderately priced at {0}",
"100#The item is expensive at {0}");

return new ChoiceFormat(rules);
}


ChoiceFormat can be created using any of the methods mentioned above, each with its own advantages and considerations. However, some methods may be easier to maintain and less error-prone than others.

Among the options, if I were to choose one, I would prefer using the last method demonstrated, which involves using string rules. This method provides greater flexibility and simplicity in defining the rules for the ChoiceFormat. By using string rules, you can easily specify the mappings between input values and their corresponding text representations in a concise and readable manner. This approach often results in code that is easier to understand, modify, and maintain.

Above format can be used as below



ChoiceFormat priceFormat = usingStringRules();

double price = 120;
Object[] formatArguments = {price};
String formattedPrice = MessageFormat.format(priceFormat.format(price), formatArguments);
System.out.println(formattedPrice); // The item is expensive at 120


Just imaging how this capability can be used by logging framework !


ChoiceFormat empowers to amplify the range of choices available for message formatting. By incorporating ChoiceFormat into your code, you can introduce a multitude of options, enriching the formatting possibilities.

The versatility of ChoiceFormat allows you to define and customize a wide array of choices, each with its own designated format. This flexibility enables you to create dynamic and adaptive messages that cater to different input values.

With ChoiceFormat at your disposal, you can enhance your message formatting capabilities, opening up new avenues for crafting comprehensive and adaptable output.