2

I'm trying to understand few details about NT-Xent loss defined in SimCLR paper(link). The loss is defined as

$$\mathcal{l}_{i,j} = -\log\frac{\exp(sim(z_i,z_j)/\tau)}{\sum_{k=1}^{2N}\mathbb{1}_{[k\neq i]} \exp(sim(z_i,z_k)/\tau)}$$

Where $z_i$ and $z_j$ represent two augmentations for the same image. What I don't understand is: at the denominator, I understand that we want to exclude the point $z_i$ using the indicator function but shouldn't we exclude also $z_j$? Otherwise we will have $k=j$ for some $k$. Essentially, why we do let the positive sample at the denominator?

James Arten
  • 307
  • 1
  • 12

2 Answers2

3

To train the model, you take $N$ samples then build $N$ pairs by means of applying two different augmentations. So the total num. of samples is $2N$. Now, you take one pair $(i,j)$ and consider that to be positive. The remaining $N-1$ pairs (so $2(N-1)=2N-2$ samples) are negatives.

For each positive pair $(i,j)$ you compute the similarity of the positive pair at the numerator, and normalize it by the sum of similarities between the anchor $i$ and the negative $k$: if you reason on pairs, instead of single samples, you have that $k\neq j$. Maybe in the paper is not so clear, but $i$ and $j$ are positive because they are two different views of the same input sample (say $x$), instead $(i, k)$ and $(k, j)$ are negatives because $k$ is a view from another input (say $x'$, where $x\neq x'$.) So, when computing the denominator you do not include $j$. I found a blog post that explains that too.

Lastly, the loss is also computed on $(j,i)$ pairs, so you sum it and divide by 2 times the batch size $N$.

Luca Anzalone
  • 3,216
  • 4
  • 15
0

If you look at this part: $$\frac{\exp(sim(z_i,z_j)/\tau)}{\sum_{k=1}^{2N}\mathbb{1}_{[k\neq i]} \exp(sim(z_i,z_k)/\tau)}$$ you will notice that it's similar to: $$\frac{\exp(x_i)}{\sum_{i=1}^{N}\exp(x_j)}$$ Which is basically a softmax function. It's essentially a way to normalize the loss function so that all the probabilities sum up to 1. In the SimCLR case, you only have to exclude k=i, because of course the similarity of a sample with itself will always be 1.

Essentially this normalization is the reason why the positive samples (similarity close to 1.0) are pushed away from negative samples (similarity close to 0).

Doch88
  • 101
  • 3