Long-context capability and computational efficiency are among the central challenges facing today's large language models. Existing efficient attention methods reduce computational complexity, but they typically suffer from a limited coverage of the model state. This paper proposes ROSA-Tuning, a retrieval-and-recall mechanism for enhancing the long-context modeling ability of pretrained models. Beyond the standard attention mechanism, ROSA-Tuning leverages in parallel a CPU-based ROSA (RWKV Online Suffix Automaton) retrieval module, which efficiently locates historical positions in long contexts that are relevant to the current query, and injects the retrieved information into the model state in a trainable manner; subsequent weighted fusion can then be handled by range-restricted attention. To enable end-to-end training, we employ the binary discretization strategy and the counterfactual gradient algorithm, and further optimize overall execution efficiency via an asynchronous CPU-GPU pipeline. Systematic evaluations on Qwen3-Base-1.7B show that ROSA-Tuning substantially restores the long-context modeling ability of windowed-attention models, achieving performance close to and in some cases matching global attention on benchmarks such as LongBench, while maintaining computational efficiency and GPU memory usage that are nearly comparable to windowed-attention methods, offering a new technical path for efficient long-context processing. The example code can be found at https://github.com/zyaaa-ux/ROSA-Tuning.
翻译:长上下文处理能力和计算效率是当前大语言模型面临的核心挑战之一。现有的高效注意力方法降低了计算复杂度,但通常存在模型状态覆盖范围有限的问题。本文提出ROSA-Tuning,一种用于增强预训练模型长上下文建模能力的检索-召回机制。在标准注意力机制之外,ROSA-Tuning并行利用基于CPU的ROSA(RWKV在线后缀自动机)检索模块,该模块能高效定位长上下文中与当前查询相关的历史位置,并以可训练的方式将检索到的信息注入模型状态;后续的加权融合则可通过范围受限注意力进行处理。为实现端到端训练,我们采用二值离散化策略与反事实梯度算法,并通过异步CPU-GPU流水线进一步优化整体执行效率。在Qwen3-Base-1.7B上的系统评估表明,ROSA-Tuning能显著恢复窗口注意力模型的长上下文建模能力,在LongBench等基准测试中达到接近甚至部分匹配全局注意力的性能,同时保持与窗口注意力方法相近的计算效率和GPU内存占用,为高效长上下文处理提供了新的技术路径。示例代码可见https://github.com/zyaaa-ux/ROSA-Tuning。