I know this is a trivial question, but I’m really confused about which metric to choose—whether it depends on the model itself, or if there is a universally agreed-upon metric for computing semantic similarity.
Suppose I have a text-to-embedding model (e.g., Gemini, GPT, or any other) and I want to compute the similarity between two texts with same model. These texts will generate two vectors (embeddings) of the same dimension.
Should I use cosine similarity, cosine distance, or Euclidean distance? I realize that in large language models (LLMs), angle-based measurements like cosine similarity are more commonly used than absolute distances like Euclidean distance.
However, I also realize that cosine similarity and cosine distance have a linear relationship, so it doesn't really matter which one I choose—it’s just an encoding issue.
Both cosine similarity and cosine distance exhibit properties related to orthogonality. My question is whether orthogonality (i.e., a 90-degree angle) represents full dissimilarity, or if the fully opposite 180-degree angle should be considered "fully dissimilar."
It's clear that when two vectors have a cosine angle of 0 degrees, they are identical. But I'm wondering if two vectors with a cosine angle of 180 degrees are actually considered fully dissimilar, as opposed to the 90-degree angle, which is often thought to represent orthogonality.
Again, remember that I don't expect the answer "It depends on the model you used." Because mostly they didn't tell me this information.
If orthogonality represents full disimmilarity, then I would use absolute cosine similarity therefore the range would be $[0, 1]$ instead of $[-1, 1]$. or squared cosine similarity which produces the same range but with smoother compared to absolute.
But if 180 degress represents full disimmilarity, then I would use angular distance where $[0, \pi]$ is mapped to $[0,1]$.
The goal is same, I want to mapping $[0,1]$ where 1 represents full similarity, while 0 full disimmilarity.