The scalability of long-context large language models is fundamentally limited by the quadratic memory cost of exact self-attention, which often leads to out-of-memory (OOM) failures on modern hardware. Existing methods improve memory efficiency to near-linear complexity, while assuming that the full query, key, and value tensors fit in device memory. In this work, we remove this assumption by introducing CQS Divide, an operation derived from cyclic quorum sets (CQS) theory that decomposes attention into a set of independent subsequence computations whose recomposition yields exactly the same result as full-sequence attention. Exploiting this decomposition, we introduce Stream-CQSA, a memory-adaptive scheduling framework that partitions attention into subproblems that fit within arbitrary memory budgets. This recasts attention from a logically monolithic operation into a collection of schedulable tasks, enabling flexible execution across devices without inter-device communication. Experiments demonstrate predictable memory scaling and show that exact attention over billion-token sequences can be executed on a single GPU via streaming, without changing the underlying mathematical definition of attention or introducing approximation error.
翻译:长上下文大语言模型的可扩展性从根本上受限于精确自注意力计算的二次方内存开销,这常导致现代硬件上出现内存溢出(OOM)故障。现有方法虽能将内存效率提升至接近线性复杂度,但均假设完整的查询(Query)、键(Key)和值(Value)张量可容纳于设备内存中。本研究通过引入基于循环商集(CQS)理论推导的CQS Divide运算,彻底打破这一假设——该运算将注意力机制分解为一系列独立子序列计算,其重组结果与完整序列注意力计算完全一致。基于这种分解,我们提出Stream-CQSA内存自适应调度框架,将注意力计算划分为可适配任意内存预算的子问题。这使注意力从逻辑上的整体运算重构为可调度任务集合,从而无需设备间通信即可实现跨设备灵活执行。实验表明,该方法具有可预测的内存扩展特性,且可在单GPU上通过流式处理完成十亿级token序列的精确注意力计算,同时不改变注意力的底层数学定义或引入近似误差。