6

I have some questions about using (encoder / decoder / encoder-decoder) transformer models, included (language) transformer or Vision transformer.

The overall form of a transformer consists of an encoder and a decoder. Depending on the model, you may use only the encoder, only the decoder, or both. However, for what purpose do model designers use only encoders, only decoders, or both?

I already knew that encoders in transformers are as known as taking in a sequence of input data and generates a fixed-length representation of it. This representation can then be fed into the decoder to generate an output sequence. In other words, the encoder can be thought of as a kind of compression that extracts features from the data. And the decoder can be thought of as playing a role in returning the compressed information in the encoder to its original state. So I'm wondering why some models work without having both an encoder and a decoder.

Few days ago, I think use only encoders are useful to classifying classes. Because DOSOVITSKIY, Alexey, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. paper shows only encoder to classification images. Decoders are useful in generative things, because WANG, Jianfeng, et al. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022. paper using encoder to encode the visual information from the input image into a representation that can be used by the text decoder to generate text. Then, to generate text, they give the 'encoder's output and the text' as the decoder's input.

But, I am sure about that my think are wrong because of BERT and GPT. BERT using encoder and does not have a decoder. GPT uses decoder and does not have a encoder. A typical user thinks that BERT and GPT equally answer the question asked by the user. So they think BERT and GPT provide the same service. However, in terms of model structure, BERT and GPT are completely different.

So, I have two questions about each functional part that makes up the transformer.

  1. what does encoder and decoder do in transformer? The transformer referred to here can be text or image.
  2. For what purpose do model designers use only encoders, only decoders, or both encoders and decoders?

Thank you.

Yang
  • 79
  • 1
  • 6

1 Answers1

10

The original transformer paper presents the transformer as a model consisting of both encoder and decoder. However, many times you will see (or hear) people describing their model as a "transformer model", but it actually consists only of an encoder or only of a decoder. That's fine as long as you know what exactly is going on. (I would say that the distinctive feature of the transformer model is the attention layer.)

For example, as you mentioned, Generative Pretrained Transformer (GPT) is a decoder-only model, and Vision Transformer (ViT) is an encoder-only model.

I feel like your question already contains the answer :) Yes, encoders are used exactly for that - encoding (or compressing) the input. Note that the output sequence from the encoder has the same length as the input sequence. This is why encoder-only models cannot be used for tasks where the output sequence has a different length or unknown length, e.g. machine translation, text summarization. In these cases you need the decoder.

Encoder-only models like ViT and BERT will encode the input sequence and produce an output sequence of the same length. Then for classification they will use only the first element of the output sequence.

Vision Transformer

As I said, encoder-only models cannot be used for generating an output sequence of different (or unknown) length, i.e. BERT cannot be used for machine translation or generative question answering. BERT can only be used for question answering on SQuAD-type datasets, i.e. your answer is a continuous segment from the input and the model only outputs <START_IDX> and <END_IDX> to mark it.

For generation tasks you need a decoder. The decoder is actually an auto-regressive model. It will generate the elements of the output sequence one-by-one until it decides that the sequence is ready and then it generates the final <END> token. See here.

Previously people used an encoder-decoder architecture to solve these sequence-to-sequence tasks (e.g. T5). You want the encoder to encode your input sequence and the decoder to decode it and produce the output sequence. However, it turns out that you can use a decoder-only model. You simply concatenate your source and target sentences and treat the task as a language modelling task. Now there are pros and cons to this approach, but the main takeaway is that you can do it if you want to.

You can checkout these papers to read more about encoder-decoder vs decoder-only models for sequence to sequence tasks:
https://arxiv.org/pdf/2304.04052.pdf
https://aclanthology.org/2022.aacl-main.43.pdf

pi-tau
  • 995
  • 6
  • 12