2

I know this is a trivial question, but I’m really confused about which metric to choose—whether it depends on the model itself, or if there is a universally agreed-upon metric for computing semantic similarity.

Suppose I have a text-to-embedding model (e.g., Gemini, GPT, or any other) and I want to compute the similarity between two texts with same model. These texts will generate two vectors (embeddings) of the same dimension.

Should I use cosine similarity, cosine distance, or Euclidean distance? I realize that in large language models (LLMs), angle-based measurements like cosine similarity are more commonly used than absolute distances like Euclidean distance.

However, I also realize that cosine similarity and cosine distance have a linear relationship, so it doesn't really matter which one I choose—it’s just an encoding issue.

Both cosine similarity and cosine distance exhibit properties related to orthogonality. My question is whether orthogonality (i.e., a 90-degree angle) represents full dissimilarity, or if the fully opposite 180-degree angle should be considered "fully dissimilar."

It's clear that when two vectors have a cosine angle of 0 degrees, they are identical. But I'm wondering if two vectors with a cosine angle of 180 degrees are actually considered fully dissimilar, as opposed to the 90-degree angle, which is often thought to represent orthogonality.

Again, remember that I don't expect the answer "It depends on the model you used." Because mostly they didn't tell me this information.


If orthogonality represents full disimmilarity, then I would use absolute cosine similarity therefore the range would be $[0, 1]$ instead of $[-1, 1]$. or squared cosine similarity which produces the same range but with smoother compared to absolute.

But if 180 degress represents full disimmilarity, then I would use angular distance where $[0, \pi]$ is mapped to $[0,1]$.

The goal is same, I want to mapping $[0,1]$ where 1 represents full similarity, while 0 full disimmilarity.

1 Answers1

2

The convention is to use cosine similarity or cosine distance as you might know LLM's embedding space is very high dimensional and Euclidean distance-like metrics between points tend to be less informative and robust thus encounter the curse of dimensionality. And embedding vectors' orthogonality is conceptually different from dissimilarity.

Cosine similarity ranges from -1 to 1. A cosine similarity of 1 means the vectors are perfectly aligned (no angle between them), indicating maximum similarity, whereas a value of -1 implies they are diametrically opposite, reflecting maximum dissimilarity. Values near zero indicate orthogonality.

Cosine distance excels in clustering tasks like K-means clustering, grouping semantically related documents even if they don't use the same words. It's ideal for high-dimensional, sparse data common in text processing. By capturing vector orientation, it enables more accurate thematic clustering. This technique is often used in unsupervised learning to discover natural groupings within document sets, such as grouping news articles by topics.

Therefore 180 degrees represents full dissimilarity where texts are related thus not orthogonal. For example, many words are unrelated such as "Life" vs "Pencil", so their embeddings are orthogonal but not fully dissimilar. While "Life" and "Death" are fully dissimilar but obviously related.

cinch
  • 11,000
  • 3
  • 8
  • 17