In edge computing or real-time IoT regression applications, transformer's quadratic attention time complexity and large parameter counts make them impractical, where other DL models such as LSTM which has much lower memory footprint are apparently more advantageous.
Furthermore, even for sequential or time-series data without real-time efficiency consideration, an autoregressive transformer would not be more natural in all cases as you claimed. Transformer is comparatively good at capturing long-range dependencies but not for all kinds of time-series, such as chaotic ones which you can further read a recent paper by Valle & Bruno (2025) "Forecasting chaotic time series: Comparative performance of LSTM-based and Transformer-based neural networks".
The complexity and sensitivity to initial conditions are the main characteristics of chaotic dynamical systems, making long-term forecasting a significant challenge. Deep learning, however, is a powerful technique that can potentially improve forecasting in chaotic time series. In this study, we explored the performance of modern neural network architectures in forecasting chaotic time series with different Lyapunov exponents. To accomplish this, we created a robust dataset composed of chaotic orbits with Lyapunov exponents ranging from 0.019 to 1.253 and used state-of-the-art neural network models for time series forecasting, including recurrent-based and transformer-based architectures. Our results show that LSTNet presents the best results in one-step-ahead and the recursive one-step-ahead forecasting for the majority of the time series in our dataset, enabling the prediction of chaotic time series with high Lyapunov exponent. Additionally, we observed that the sensitivity to initial conditions and complexity still affects the performance of the neural networks, decaying predictive power in time series with larger Lyapunov exponent.
In summary transformers struggle with multi-step forecasting of high-Lyapunov-exponent chaotic time-series due to error accumulation, and LSTNet which combines convolutional and recurrent components effectively captures both short-term patterns and long-term dependencies inherent in chaotic time series.