What are Transformers worse at than classical deep learning models

Question

Background: It's well-known that transformers offer great performance across a large number of tasks, and this is largely a consequence of their powerful attention mechanisms and flexibility. However, in this post I'm curious about the opposite:

Question: Namely, what are some well-known cases, problems, or real-world applications where transformers don't perform very well especially compared to classical deep learning models such as MLPs, CNNs, etc...

Specification: I'm looking for specific regression tasks (with accessible datasets) where transformers are not the state-of-the-art. The regression task should be "naturally suitable", so no sequential or time-dependent data (in which case an autoregressive transformer would be more natural).

This is a follow-up to the MLP question I asked some years back here

score 2 · Answer 1 · answered Mar 31 '25 at 07:00

In edge computing or real-time IoT regression applications, transformer's quadratic attention time complexity and large parameter counts make them impractical, where other DL models such as LSTM which has much lower memory footprint are apparently more advantageous.

Furthermore, even for sequential or time-series data without real-time efficiency consideration, an autoregressive transformer would not be more natural in all cases as you claimed. Transformer is comparatively good at capturing long-range dependencies but not for all kinds of time-series, such as chaotic ones which you can further read a recent paper by Valle & Bruno (2025) "Forecasting chaotic time series: Comparative performance of LSTM-based and Transformer-based neural networks".

The complexity and sensitivity to initial conditions are the main characteristics of chaotic dynamical systems, making long-term forecasting a significant challenge. Deep learning, however, is a powerful technique that can potentially improve forecasting in chaotic time series. In this study, we explored the performance of modern neural network architectures in forecasting chaotic time series with different Lyapunov exponents. To accomplish this, we created a robust dataset composed of chaotic orbits with Lyapunov exponents ranging from 0.019 to 1.253 and used state-of-the-art neural network models for time series forecasting, including recurrent-based and transformer-based architectures. Our results show that LSTNet presents the best results in one-step-ahead and the recursive one-step-ahead forecasting for the majority of the time series in our dataset, enabling the prediction of chaotic time series with high Lyapunov exponent. Additionally, we observed that the sensitivity to initial conditions and complexity still affects the performance of the neural networks, decaying predictive power in time series with larger Lyapunov exponent.

In summary transformers struggle with multi-step forecasting of high-Lyapunov-exponent chaotic time-series due to error accumulation, and LSTNet which combines convolutional and recurrent components effectively captures both short-term patterns and long-term dependencies inherent in chaotic time series.

What are Transformers worse at than classical deep learning models

1 Answers1