We introduce an Outlier-Efficient Modern Hopfield Model (termed $\mathrm{OutEffHop}$) and use it to address the outlier inefficiency problem of {training} gigantic transformer-based models. Our main contribution is a novel associative memory model facilitating \textit{outlier-efficient} associative memory retrievals. Interestingly, this memory model manifests a model-based interpretation of an outlier-efficient attention mechanism (${\rm Softmax}_1$): it is an approximation of the memory retrieval process of $\mathrm{OutEffHop}$. Methodologically, this allows us to introduce novel outlier-efficient Hopfield layers as powerful alternatives to traditional attention mechanisms, with superior post-quantization performance. Theoretically, the Outlier-Efficient Modern Hopfield Model retains and improves the desirable properties of standard modern Hopfield models, including fixed point convergence and exponential storage capacity. Empirically, we demonstrate the efficacy of the proposed model across large-scale transformer-based and Hopfield-based models (including BERT, OPT, ViT, and STanHop-Net), benchmarking against state-of-the-art methods like $\mathtt{Clipped\_Softmax}$ and $\mathtt{Gated\_Attention}$. Notably, $\mathrm{OutEffHop}$ achieves an average reduction of 22+\% in average kurtosis and 26+\% in the maximum infinity norm of model outputs across four models. Code is available at \href{https://github.com/MAGICS-LAB/OutEffHop}{GitHub}; models are on \href{https://huggingface.co/collections/magicslabnu/outeffhop-6610fcede8d2cda23009a98f}{Hugging Face Hub}; future updates are on \href{https://arxiv.org/abs/2404.03828}{arXiv}.
翻译:我们提出了一种异常值高效现代霍普菲尔德模型(称为 $\mathrm{OutEffHop}$),并利用其解决训练超大规模基于Transformer模型时的异常值低效问题。我们的主要贡献是构建了一种新颖的联想记忆模型,该模型能够实现**异常值高效**的联想记忆检索。有趣的是,该记忆模型为一种异常值高效注意力机制(${\rm Softmax}_1$)提供了基于模型的解释:它是对 $\mathrm{OutEffHop}$ 记忆检索过程的近似。在方法论上,这使我们能够引入新型的异常值高效霍普菲尔德层,作为传统注意力机制的高性能替代方案,并具有优异的量化后性能。理论上,该异常值高效现代霍普菲尔德模型保留并改进了标准现代霍普菲尔德模型的优良特性,包括不动点收敛性和指数级存储容量。实证方面,我们在多种基于Transformer和基于霍普菲尔德的大型模型(包括 BERT、OPT、ViT 和 STanHop-Net)上验证了所提出模型的有效性,并与 $\mathtt{Clipped\_Softmax}$ 和 $\mathtt{Gated\_Attention}$ 等最先进方法进行了基准测试。值得注意的是,$\mathrm{OutEffHop}$ 在四个模型上平均将输出峰度降低了22%以上,将最大无穷范数降低了26%以上。代码发布于 \href{https://github.com/MAGICS-LAB/OutEffHop}{GitHub};模型发布于 \href{https://huggingface.co/collections/magicslabnu/outeffhop-6610fcede8d2cda23009a98f}{Hugging Face Hub};后续更新发布于 \href{https://arxiv.org/abs/2404.03828}{arXiv}。