In sequence-to-sequence, why is the output of the decoder used as its input?

Question

The basic seq-2-seq model consists of 2 parts: a recurrent encoder that compresses a sequence to a vector and decoder that unrolls the vector into the output sequence:

Why is the output, w, x, y, z of the decoder used as its input? Shouldn't the hidden state of the RNN from the previous timestamps be enough?

score 3 · Answer 1 · edited Dec 12 '21 at 12:08

In seq2seq they model the joint distribution of whatever char/word sequence by decomposing it into time-forward conditionals:

\begin{align*} p(w_1,w_2, \dots,w_n) &= \ p(w_1)*p(w_2|w_1) * \ ... \ * p(w_n|w_1, \dots,w_{n-1}) \\ &= \ p(w_1)*\prod_{i=2}^{n}p(w_i|w_{<i}) \end{align*}

This can be sampled by sampling each of the conditional in ascending order. So, that's exactly what they're trying to imitate. You want the second output dependant on the sampled first output, not its distribution.

This is why the hidden state is NOT good for modeling this setup because it is a latent representation of the distribution, not a sample of the distribution.

Note: In training, they use ground-truth as input by default because it working under the assumption the model should've predicted the correct word, and, if it didn't, the gradient of the word/char level loss will reflect that (this is called teacher forcing and has a multitude of pitfalls).

Neil Slater · Answer 2 · 2019-07-03T16:07:17.423

Shouldn't the hidden state of the RNN from the previous timestamps be enough?

It is theoretically enough to generate a sequence. However, allowing an input offers a couple of convenient extras:

Training data for output sequences is used twice - once as input (as previous sequence data), once as target (to establish loss metric). This may help training process as the decoder trains both as a decoder to the new sequence type and as a predictive model over the output sequence semi-independently - i.e. weights from input to RNN layer are affected by error gradients separately to weights between previous hidden state and next state, although the two sets of weights together influence output and next state, so are not fully independent over a sequence.
By allowing input of sequence so far generated, the decoder can work as a generator, where the next item in the sequence does not need to be the maximum probability item, but can be sampled or have rules applied. This allows for approaches such as BEAM search, commonly used in machine translation, which maintains several potential outputs, selecting best one at the end.

I have not done the experiment, but I suspect the first item results in faster and better generalisation. The second one is very convenient for natural language generation and similar problems.

score 0 · Answer 3 · answered Jul 03 '19 at 08:05

In the original seq2seq paper, they used two RNN, one for encoding and one for decoding. In the encoder they need to unroll the inputs to capture the time dependency. Now if we want to pass the hidden state from the encoder to the decoder that means that the decoder hidden state shape needs to match the encoder (aka same architecture). Since the architecture is the same, we can not directly generate a sequence of n samples within the decoder without unrolling it and you can not unroll it without an input.

In sequence-to-sequence, why is the output of the decoder used as its input?

3 Answers3