Recent advances in deep learning have relied heavily on the use of large Transformers due to their ability to learn at scale. However, the core building block of Transformers, the attention operator, exhibits quadratic cost in sequence length, limiting the amount of context accessible. Existing subquadratic methods based on low-rank and sparse approximations need to be combined with dense attention layers to match Transformers, indicating a gap in capability. In this work, we propose Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods, matching attention-based models. We set a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K. Hyena operators are twice as fast as highly optimized attention at sequence length 8K, and 100x faster at sequence length 64K.
翻译:近期深度学习领域的进展在很大程度上依赖于大型Transformer的使用,因其具备大规模学习的能力。然而,Transformer的核心构建模块——注意力算子——在序列长度上呈现二次方成本,限制了可访问的上下文范围。基于低秩和稀疏逼近的现有次二次方法需要与密集注意力层结合才能匹敌Transformer,表明能力上存在差距。在本工作中,我们提出Hyena,一种通过交织隐式参数化的长卷积与数据控制门控构建的注意力次二次替代方案。在涉及数千至数十万词元序列的召回与推理任务中,Hyena相较于依赖状态空间及其他隐式和显式方法的算子,准确率提升超过50%,与基于注意力的模型持平。我们在标准数据集(WikiText103和The Pile)上为无密集注意力架构设定了语言建模的新最优水平,在序列长度2K时以训练计算量减少20%达到Transformer质量。Hyena算子在序列长度8K时速度是高度优化注意力的两倍,在序列长度64K时速度提升100倍。