What's make transformer encoder difference from its decoder part?

Question

Encoders are commonly used for understanding sequences (like classification, NER, etc.), while decoders are used for generative tasks in an autoregressive manner (e.g., GPT-style text generation).

Decoders are autoregressive because they generate one token at a time, conditioning on previously generated tokens.

In contrast, encoders process the entire input sequence in a single pass and return contextualized representations of the tokens.

If I train the encoder part of a Transformer using a sliding window approach, where the model uses the output of the previous inference as input for the next step (similar to a feedback loop), wouldn't it behave in a way that's similar to a decoder's autoregressive behavior?

In this setup, I'm essentially giving the model the last predicted output as the next input, so wouldn't this iterative process have similar properties to autoregressive generation in a decoder?

For example Xt, yt here to train "transformer encoder only model" for every timestep:

x1: [I, <NEXT>, <PAD>, <PAD>, <PAD>]
y1: [I, LOVE, <NEXT>, <PAD>, <PAD>]
x2: [I, LOVE, <NEXT>, <PAD>, <PAD>]
y2: [I, LOVE, KANADE, <NEXT>, <PAD>]
x3: [I, LOVE, KANADE, <NEXT>, <PAD>]
y3: [I, LOVE, KANADE, <EOT>, <PAD>]
x4: [I, LOVE, KANADE, <EOT>, <PAD>]
y4: [I, LOVE, KANADE, <EOT>, <PAD>]
x5 and so on: same as x4
y5 and so on: same as x4

Both encoders and decoders use self-attention mechanisms. Given that self-attention is responsible for capturing dependencies between tokens, what are the key bounds or differences between the two?

Is the primary difference just that the decoder generates tokens one by one, while the encoder processes the whole sequence at once?

I find this situation reminiscent of the traditional autoencoder structure, where the encoder maps the input to a latent space, and the decoder maps this latent space back into a human-readable representation. But, transformer encoder-decoder model is unintuitive for me.

cinch · Accepted Answer · 2024-12-01T06:56:29.870

You’re right that encoder-decoder transformer aligns with the traditional autoencoder (AE) structure except AE’s encoder output is usually a compressed latent representation while transformer’s encoder output is an enriched contextualized representation.

While your sliding window approach makes an encoder behave similarly to a decoder, it lacks causal constraints in the sense that your encoder processes input tokens in parallel, not sequentially. This can introduce dependencies that violate autoregressive constraints, for instance, in your above window 2 the encode can attend to <NEXT> token to predict the next token. Also transformer decoders are optimized for token-by-token autoregressive generation, while your sliding windows require reprocessing overlapping inputs which can be computationally expensive.

What's make transformer encoder difference from its decoder part?

1 Answers1