Structured dilated attention has an appealing inference-time efficiency knob: it reduces the FLOPs of the attention and the KV cache size by a factor of the dilation size D, while preserving long-range connectivity. However, we find a persistent failure mode of them -- sparsifying a pretrained attention model to a dilated pattern leads to severe accuracy degradation. We introduce RAT+, a dense-pretraining architecture that augments attention with full-sequence recurrence and active recurrence learning. A single RAT+ model is pretrained densely once, then flexibly switched at inference time to dilated attention (optionally with local windows) or hybrid layer/head compositions, requiring only a short 1B-token resolution adaptation rather than retraining separate sparse models. At 1.5B parameters trained on 100B tokens, RAT+ closely matches dense accuracy at 16 and drops by about 2-3 points at 64 on commonsense reasoning and LongBench tasks, respectively. Moreover, RAT+ outperforms attention when sparsifying to the top-k block attention. We further scale to 2.6B parameters and 200B tokens and observe the same trend.
翻译:结构化扩张注意力具有一个吸引人的推理时效率调节机制:它通过扩张因子D减少注意力的浮点运算量和KV缓存大小,同时保持长程连接性。然而,我们发现此类方法存在一个持续存在的失效模式——将预训练的注意力模型稀疏化为扩张模式会导致严重的精度下降。我们提出了RAT+,这是一种密集预训练架构,通过全序列循环和主动循环学习来增强注意力。单个RAT+模型经过一次密集预训练后,即可在推理时灵活切换为扩张注意力(可选择性地结合局部窗口)或混合层/头组合,仅需进行10亿token量级的短时分辨率适应,而无需重新训练独立的稀疏模型。在15亿参数规模、基于1000亿token训练的条件下,RAT+在常识推理和LongBench任务上分别达到:扩张因子16时与密集注意力精度基本持平,扩张因子64时精度下降约2-3个百分点。此外,当稀疏化为top-k块注意力时,RAT+性能优于标准注意力。我们进一步将模型扩展至26亿参数和2000亿token规模,观察到相同的趋势。