May 5, 2023 • Engineering/ML
Understanding and processing natural language is a complex task. Embeddings are a powerful tool in todays ML/AI discussions to address this problem.
In my mental model, embeddings are used to encode words into meaning. Embeddings are encodings in n-dimensional vector space where every parameter is a signed a specific meaning in the object space.
I can't really imagine n-dimensional vectors but for an embedding system with two vectors, I like to visualize it like this:
terrain (y-axis) | | | o (mountain bike) | | o (road bike) | | | o (horse) | | o (car) | | +------------------------------- speed (x-axis)
In this example, each point represents a mode of transportation. Items with similar speed and terrain capabilities are placed closer together. For example, a mountain bike and a road bike are closer to each other because they have similar speeds and are both used on different terrains, while a car is further away since it operates at higher speeds on more paved terrains.
There are two super-powers from being able to map natural language onto an n dimensional vector space:
There are also some downsides, though:
Because of the fast way to find related vector pairs, embeddings are often used to answer search problems. In the current AI discourse, embeddings are used to quickly and efficiently find relevant context in order to put it into the prompt of a large-language model (LLM).
Imagine an AI system to ask questions about your code documentation. We can, ahead of time, create embeddings for every paragraph in the documentation and store it in a vector database (a database that can hold the embedding vectors for us and has some index to efficiently query it. Like the pgvector extension for Postgres).
Now, when the user enters a question in natural text into our AI system, we can find the most relevant paragraphs into our documentation. We usually want to pick as many as we can fit into the prompt of the LLM. The LLM now has an understanding of your documentation when answering the question. Fascinating!
One important thing from the paragraphs above is that we have to split the documentation into meaningful chunks. Since we can only retrieve similarity in windows of this chunk site, you can easily see how important it is to pick a proper system.