Echo State Transformer: Attention Over Finite Memories

While Large Language Models and their underlying Transformer architecture are remarkably efficient, they do not reflect how our brain processes and learns a diversity of cognitive tasks such as language, nor how it leverages working memory. Furthermore, Transformers encounters a computational limitation: quadratic complexity growth with sequence length. Motivated by these limitations, we aim to design architectures that leverage efficient working memory dynamics to overcome standard computational barriers. We introduce Echo State Transformers (EST), a hybrid architecture that resolves this challenge while demonstrating state of the art performance in classification and detection tasks. EST integrates the Transformer attention mechanisms with nodes from Reservoir Computing to create a fixed-size memory system. Drawing inspiration from Echo State Networks, our approach leverages several reservoirs (random recurrent networks) in parallel as a lightweight and efficient working memory. These independent units possess distinct and learned internal dynamics with an adaptive leak rate, enabling them to dynamically adjust their own temporality. By applying attention on those fixed number of units instead of input tokens, EST achieves linear complexity for the whole sequence, effectively breaking the quadratic scaling problem of standard Transformers. We evaluate ESTs on a recent timeseries benchmark: the Time Series Library, which comprises 69 tasks across five categories. Results show that ESTs ranks first overall in two of five categories, outperforming strong state-of-the-art baselines on classification and anomaly detection tasks, while remaining competitive on short-term forecasting. These results demonstrate that by shifting the attention mechanism from the entire input sequence to a fixed set of evolving memory units, it is possible to maintains high sensitivity to temporal events while achieving constant computational complexity per step.

翻译：尽管大型语言模型及其底层Transformer架构展现出卓越的效率，但它们并未反映人脑处理和学习语言等多样化认知任务的方式，也未体现大脑如何利用工作记忆。此外，Transformer面临计算限制：其复杂度随序列长度呈二次方增长。受这些局限性的启发，我们致力于设计能够利用高效工作记忆动态特性以突破传统计算瓶颈的架构。本文提出回声状态Transformer（EST）——一种混合架构，在解决这一挑战的同时，在分类与检测任务中展现出最先进的性能。EST将Transformer注意力机制与储备池计算节点相结合，构建了一个固定尺寸的记忆系统。受回声状态网络的启发，我们的方法采用多个并行储备池（随机递归网络）作为轻量级高效工作记忆。这些独立单元具有独特且可学习的内部动态特性及自适应泄漏率，使其能够动态调整自身的时间特性。通过对固定数量的单元（而非输入标记）施加注意力，EST实现了整个序列的线性复杂度，有效突破了标准Transformer的二次方缩放问题。我们在近期的时间序列基准测试集Time Series Library（包含五大类别的69项任务）上评估了EST的性能。结果表明，EST在五大类别中的两个类别综合排名第一，在分类和异常检测任务上超越了当前先进的基线模型，同时在短期预测任务中保持竞争力。这些结果证明，通过将注意力机制从整个输入序列转移到一组固定的动态记忆单元，可以在保持对时序事件高敏感度的同时，实现每步恒定计算复杂度。