“Embeddings” get thrown around a lot these days, thanks to the Large Language Models (LLMs) craze. It is one of those words I just pretend to fully understand. But I wanted to go beyond just nodding at the buzzword; I wanted to get it. So I sat down to explore the types, the math behind them, and why they matter. This post is mostly me thinking out loud, with the hope that it helps future-me and maybe you, too.
Word embeddings are dense vector representations of words that encode semantic and syntactic relationships into a multidimensional vector space
Woah, hold on, there are a lot of terms packed in one sentence. Let’s take that intimidating sentence apart, one piece at a time.
A vector is just a list of numbers. For example:
[0.12, -0.98, 0.34, 1.25, …]
Why numbers? Because computers can’t understand words the way we do, but they can work with numbers like pros. So we translate words into numbers that capture something about their meaning.
2. Semantic Relationships
“Semantic” basically means meaning. So if two words have similar meanings, their vectors will be close together in this number space.
For example:
3. Syntactic Relationships
That’s all about grammar, how words fit into sentences.
Example:
So, embeddings can capture both meaning and grammar-ish similarities.
4. Multidimensional Vector Space
These vectors don't just live in a normal 2D or 3D graph. They live in 100, 300, sometimes even in 768 dimensions.
Can we picture that? Nope. But the math doesn’t care. The more dimensions, the more space we have to arrange words so that similar ones stay close together, and unrelated ones drift apart.
It’s pretty hard to imagine hundreds of dimensions, but we can project embeddings down to 2D or 3D to get an idea of what’s going on. Below are a few images that try to show what embeddings look like when we squish them into a lower-dimensional space.
Think of it like zooming out on a social network of words. Synonyms, related concepts, or words used in the same kinds of sentences tend to be in the same neighbourhood.
Okay, but why does turning words into numbers work so well? Why do similar words magically end up close together?
Enter the Distributional Hypothesis.
A word is characterised by the company it keeps
In other words:
If two words often show up in the same kinds of sentences, they’re probably related in meaning.
Example:
Since “student” and “teacher” show up in similar environments, they must mean something similar. When we train word embeddings, we don’t explicitly tell the model what words mean. Instead, the model learns meaning by seeing which words appear near each other across tons of text.
So if two words share a neighbourhood in lots of different sentences, their vectors end up close together too.
That’s why this whole number-space trick works in the first place.
Semantic similarity refers to how close the meaning of two words is. We need a metric to measure the similarity between the v and w vectors of the N dimension. The most widely used metric for comparing two vectors (like two word embeddings) is Cosine Similarity. It is based on the dot product:
Let’s say we have two word vectors:
The intuition is that:
If two vectors point in a similar direction, their dot product will be high.
If they point in very different directions, the dot product will be closer to zero (or even negative).
The dot product is sensitive to the magnitude (length) of the vectors. That means:
Even if two vectors aren’t very similar, if one is really long, the dot product might still be large — just because of its size.
So, using the raw dot product gives more weight to longer vectors, which can mess with our interpretation of “closeness” or “similarity.” We need a metric that gives the similarity regardless of their length. To fix this, we use cosine similarity, which normalizes the vectors before taking the dot product.
The word embeddings are typically categorised into frequency-based and prediction-based methods. We will discuss two count-based approaches: co-occurrence-based and TF-IDF. The resultant matrices are highly sparse and of extremely high dimension.
These are based on the simple idea:
The words that frequently appear in a similar contexts tend to have similar meanings, and idea knows as the distributional hypothesis.
So we build huge tables (called co-occurrence matrices) that count how often each word appears next to every other word in a fixed window of context. In this matrix, each row represents a word, and each column represents a document or context. Each cell contains a count of how often a word appears in that document. The co-occurrence matrix of words can be either a word-document (term-document) matrix or a word-word (term-term) matrix. Each cell in this matrix indicates the number of times a particular word (specified by the row) appears in a document (specified by the column).
So, even from raw counts, we can already start to see which words “belong” together. Words with similar row patterns are likely semantically related.
From the table, the vector representing the document Fruit Market Overview can be represented as [10, 0, 7, 0, 0] (first column vector). Similarly, if we had to represent the vector for apple, it would be [10,0,0,8]. Okay, it’s time to determine the semantic similarity between the words and the documents!
How will we find whether fruit is more similar to apple to technology?
The vector representations for apple are [10, 0, 0, 8], fruit is [7, 5, 0, 0], and technology is [0, 0, 15, 3]. Using the cosine similarity formula, we can calculate the similarity as:
So, from the results, fruit shows a higher semantic similarity to apple as compared to technology.