8

The paper Attention Is All You Need describes the transformer architecture that has an encoder and a decoder.

However, I wasn't clear on what the cost function to minimize is for such an architecture.

Consider a translation task, for example, where give an English sentence $x_{english} = [x_0, x_1, x_2, \dots, x_m]$, the transformer decodes the sentence into a French sentence $x_{french}' = [x_0', x_1', \dots, x_n']$. Let's say the true label is $y_{french} = [y_0, y_1, \dots, y_p]$.

What is the object function of the transformer? Is it the MSE between $x_{french}'$ and $y_{french}$? And does it have any weight regularization terms?

user3667125
  • 1,700
  • 9
  • 16

1 Answers1

8

I took a look at the Tensor2Tensor's source code implementation, and it seems like the loss function is the cross-entropy between the predicted probability matrix $\|\text{sentence length}\| \times \|\text{vocab}\|$ (right before taking the argmax to find the token to output), and the $\|\text{sentence length}\|$-length vector of token IDs as the true label.

nbro
  • 42,615
  • 12
  • 119
  • 217
user3667125
  • 1,700
  • 9
  • 16