1

I am trying to understand the utility of the transformer architecture.

It is clear to me, that if we have a model, and we add in more parameters to optimize over, then in general it maybe able to be better suited for some data related task. However, it seems the fundamental philosophy underlying the deep learning stuff is that, adding more parameter in a certain way is better than adding parameters in another way.

Why is it that when add in the new parameters using the transformer architecture be better than simply just taking a plain old feed forward and adding a bunch more parameters.

2 Answers2

1

I'd say that the utility is practicality.

In theory, all these architectures are universal function approximators, i.e. they can represent or approximate more or less the same functions (see e.g. this or this for more details), but, over the years, people have observed that some are easier to train empirically or are more practically suited for some tasks than others.

One good example is the long short-term memory (LSTM), a more sophisticated recurrent neural network (RNN), which tries to circumvent practical problems that arise with the vanilla RNNs, specifically, the vanishing gradient problem.

I can't remember now all the details of the transformer, but the utility is again probably just practicality.

nbro
  • 42,615
  • 12
  • 119
  • 217
1

I think one thing that the deep learning community has learned time and time again is that parameter sharing across "locations" in the input data is important (because it increases the total amount of relevant data seen by the parameters). The first example of this is the CNN, but the same idea was used for RNNs/LSTMs, then even more so for transformers, as well as Graph-NNs. The idea is for e.g. CNN, you don't want to have to re-learn how to detect a dog from scratch for every location in an image when it is really the same task (this will divide the amount of available data by the number of positions). The same idea applies to handling similar token/node patterns (e.g. identifying a phrase or hub node) in different locations in a text/graph respectively.

So for transformers the reason they work better than feedforward NNs is because they have this property of sharing (or "reusing") the point-wise neural network weights across every location (i.e. word/token). But the reason they are better than say: CNNs is because they ALSO have the attention mechanism to fetch relevant non-local context for the point-wise neural networks to also see. This is intuitive since text is complicated and can reference things said in entirely different parts of the text (e.g. every pronoun implicitly refers to the last place that noun was defined & my comparison to CNN refers to the connection I drew in paragraph 1).

P.S. You would be hard pressed to use a fully-connected network for text at all since the number of words in a sentence will change but a Feed-forward network takes a fixed amount of inputs.

profPlum
  • 496
  • 2
  • 10