Attention-free language models that combine gating and convolutions are growing in popularity due to their efficiency and increasingly competitive performance. To better understand these architectures, we pretrain a suite of 17 attention and "gated-convolution" language models, finding that SoTA gated-convolution architectures still underperform attention by up to 2.1 perplexity points on the Pile. In fine-grained analysis, we find 82% of the gap is explained by each model's ability to recall information that is previously mentioned in-context, e.g. "Hakuna Matata means no worries Hakuna Matata it means no" $\rightarrow$ "??". On this task, termed "associative recall", we find that attention outperforms gated-convolutions by a large margin: a 70M parameter attention model outperforms a 1.4 billion parameter gated-convolution model on associative recall. This is surprising because prior work shows gated convolutions can perfectly solve synthetic tests for AR capability. To close the gap between synthetics and real language, we develop a new formalization of the task called multi-query associative recall (MQAR) that better reflects actual language. We perform an empirical and theoretical study of MQAR that elucidates differences in the parameter-efficiency of attention and gated-convolution recall. Informed by our analysis, we evaluate simple convolution-attention hybrids and show that hybrids with input-dependent sparse attention patterns can close 97.4% of the gap to attention, while maintaining sub-quadratic scaling. Our code is accessible at: https://github.com/HazyResearch/zoology.
翻译:结合门控机制与卷积的无注意力语言模型凭借其高效性与日益增长的竞争性表现而广受欢迎。为更深入理解这些架构,我们预训练了包含17个注意力模型与"门控卷积"语言模型在内的系列模型,发现当前最先进的门控卷积架构在Pile数据集上的困惑度仍比注意力模型低至多2.1点。通过细粒度分析发现,82%的性能差距源于模型对上下文先前提及信息的召回能力差异,例如"Hakuna Matata means no worries Hakuna Matata it means no"→"??"。在该被称为"关联召回"的任务中,注意力模型以显著优势超越门控卷积:一个7000万参数的注意力模型在关联召回任务上的表现优于14亿参数的门控卷积模型。这一结果令人意外,因为先前研究表明门控卷积可完美解决关联召回能力的合成测试。为弥合合成任务与真实语言间的差距,我们提出名为"多查询关联召回"(MQAR)的新任务形式化定义,该定义能更准确反映实际语言特征。通过对MQAR的实证与理论研究,揭示了注意力与门控卷积在召回参数效率上的差异。基于分析结论,我们评估了简单卷积-注意力混合模型,证明采用输入相关稀疏注意力模式的混合模型可将与注意力的性能差距缩小97.4%,同时保持亚二次方复杂度。我们的代码开源地址为:https://github.com/HazyResearch/zoology