Does this modified version of the triplet loss function introduced with SBERT that uses the cosine similarity make sense?

Question

I am working on a modified version of the triplet loss function introduced with SBERT, where instead of the Euclidean distance we use the cosine similarity. The formula to minimize is max( (|s_a*s_p| / |s_a|*|s_p|) - (|s_a*s_n| / |s_a|*|s_n|) + e, 0) where s_a is the embedding of the anchor sentences (the context), s_p is the embedding of the positive sentence (correct continuation) and s_n is the embedding of the negative sentence (wrong continuation).

I would like to check that the function I came up with makes sense from a theoretical point of view. Where should I look to check which features a loss function should satisfy?

Motivation for the question: I'm getting my hands dirty with contrastive loss functions, and this is an easy variation I came up with.

score 1 · Accepted Answer · answered Apr 04 '22 at 11:53

A Loss function is just a function with a minimum.

In machine learning though, we also require the loss function to be differentiable, otherwise no backpropagation and hence no weight updating. Moreover basically every deep learning library relies on autograd, so if the function is not differentiable your code will simply crash.

Stronger but not compulsory condition could be Lipschitz continuity, i.e. ensuring that the function decrease at a constant rate. Intuitively, a loss function should output high values for big differences between predictions/targets and small values for small changes, otherwise the update of the weights will risk to be too big (no convergence) or too small (easily stuck on local minima).

Regarding your loss, the only issue I see is that you're replacing a proper metric, i.e. euclidean distance, with a function that is not a metric, i.e. cosine similarity (which does not respect the triangular inequality, hence it's not a metric). So I would be careful and test what kind of values you get with some dummy data to understand if it still behave as a proper loss.

Does this modified version of the triplet loss function introduced with SBERT that uses the cosine similarity make sense?

1 Answers1