Avey-B - 专知论文

Compact pretrained bidirectional encoders remain the backbone of industrial NLP under tight compute and memory budgets. Their effectiveness stems from self-attention's ability to deliver high-quality bidirectional contextualization with sequence-level parallelism, as popularized by BERT-style architectures. Recently, Avey was introduced as an autoregressive, attention-free alternative that naturally admits an encoder-only adaptation. In this paper, we reformulate Avey for the encoder-only paradigm and propose several innovations to its architecture, including decoupled static and dynamic parameterizations, stability-oriented normalization, and neural compression. Results show that this reformulated architecture compares favorably to four widely used Transformer-based encoders, consistently outperforming them on standard token-classification and information-retrieval benchmarks while scaling more efficiently to long contexts.

翻译：紧凑的预训练双向编码器在严格的计算和内存预算下，仍然是工业自然语言处理的支柱。其有效性源于自注意力机制能够通过序列级并行化提供高质量的双向上下文建模，这一点已由BERT风格的架构所普及。最近，Avey作为一种自回归、无注意力的替代方案被提出，它自然地允许仅编码器适配。在本文中，我们为仅编码器范式重新设计了Avey，并对其架构提出了若干创新，包括解耦的静态与动态参数化、面向稳定性的归一化以及神经压缩。结果表明，这种重新设计的架构与四种广泛使用的基于Transformer的编码器相比具有优势，在标准的词元分类和信息检索基准测试中持续优于它们，同时能更高效地扩展到长上下文场景。

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

TransMLA：多头潜在注意力（MLA）即为所需

专知会员服务

23+阅读 · 2025年2月13日

大模型时代还不理解自注意力(Self-Attention)？这篇文章教你从头写代码实现

专知会员服务

36+阅读 · 2024年2月12日

【CVPR2023】BiFormer:基于双层路由注意力的视觉Transformer

专知会员服务

35+阅读 · 2023年3月20日

《分布式多智能体强化学习的编码》加州大学等

专知会员服务

55+阅读 · 2022年11月2日