What is the cost function of a transformer?

Question

The paper Attention Is All You Need describes the transformer architecture that has an encoder and a decoder.

However, I wasn't clear on what the cost function to minimize is for such an architecture.

Consider a translation task, for example, where give an English sentence $x_{english} = [x_0, x_1, x_2, \dots, x_m]$, the transformer decodes the sentence into a French sentence $x_{french}' = [x_0', x_1', \dots, x_n']$. Let's say the true label is $y_{french} = [y_0, y_1, \dots, y_p]$.

What is the object function of the transformer? Is it the MSE between $x_{french}'$ and $y_{french}$? And does it have any weight regularization terms?

score 8 · Accepted Answer · edited Oct 19 '21 at 12:20

8

I took a look at the Tensor2Tensor's source code implementation, and it seems like the loss function is the cross-entropy between the predicted probability matrix $\|\text{sentence length}\| \times \|\text{vocab}\|$ (right before taking the argmax to find the token to output), and the $\|\text{sentence length}\|$-length vector of token IDs as the true label.

edited Oct 19 '21 at 12:20

nbro

42,615
12
119
217

answered Dec 10 '20 at 07:16

user3667125

1,700
9
16

What is the cost function of a transformer?

1 Answers1