I am trying to understand the utility of the transformer architecture.
It is clear to me, that if we have a model, and we add in more parameters to optimize over, then in general it maybe able to be better suited for some data related task. However, it seems the fundamental philosophy underlying the deep learning stuff is that, adding more parameter in a certain way is better than adding parameters in another way.
Why is it that when add in the new parameters using the transformer architecture be better than simply just taking a plain old feed forward and adding a bunch more parameters.