Transformers have become the de facto models of choice in machine learning, typically leading to impressive performance on many applications. At the same time, the architectural development in the transformer world is mostly driven by empirical findings, and the theoretical understanding of their architectural building blocks is rather limited. In contrast, Dense Associative Memory models or Modern Hopfield Networks have a well-established theoretical foundation, but have not yet demonstrated truly impressive practical results. We propose a transformer architecture that replaces the sequence of feedforward transformer blocks with a single large Associative Memory model. Our novel architecture, called Energy Transformer (or ET for short), has many of the familiar architectural primitives that are often used in the current generation of transformers. However, it is not identical to the existing architectures. The sequence of transformer layers in ET is purposely designed to minimize a specifically engineered energy function, which is responsible for representing the relationships between the tokens. As a consequence of this computational principle, the attention in ET is different from the conventional attention mechanism. In this work, we introduce the theoretical foundations of ET, explore it's empirical capabilities using the image completion task, and obtain strong quantitative results on the graph anomaly detection task.
翻译:Transformer已成为机器学习领域事实上的首选模型,通常在众多应用中展现出卓越性能。然而,Transformer架构的发展主要受实证结果驱动,对其构件的理论理解仍较为有限。相比之下,密集联想记忆模型或现代Hopfield网络虽具有完善的理论基础,却尚未在实用层面取得显著突破。我们提出一种新型Transformer架构,将前馈Transformer模块序列替换为单一大型联想记忆模型。这种名为能量Transformer(简称ET)的架构保留了当前主流Transformer的诸多常见原始设计,但并非完全等同于现有架构。ET中的Transformer层序列经过刻意设计,旨在最小化一个专门构建的能量函数,该函数负责表征令牌间的关系。基于这一计算原则,ET中的注意力机制与传统注意力机制存在本质差异。本研究阐述了ET的理论基础,通过图像补全任务探索其实验能力,并在图异常检测任务中获得了可靠的量化结果。