Long-context inference increasingly operates over CPU-resident KV caches, either because decoding-time KV states exceed GPU memory capacity or because disaggregated prefill-decode systems place KV data in host memory. Although block-sparse attention reduces attention cost in this setting, sparsity alone is insufficient for end-to-end efficiency. GPU-only designs remain constrained by PCIe bandwidth and metadata memory overhead, while CPU-GPU hybrid designs still suffer from substantial GPU idle time and bottlenecks in CPU-side top-k selection and sparse attention computation. Fluxion is built on three key insights: output-aware KV budgeting, head-specific and granularity-aware sparse configuration, and cross-device coordinated execution for sparse attention over CPU-resident KV caches. Guided by these insights, Fluxion combines a lightweight head-property predictor, a granularity-budget selector, and a priority-based scheduler to jointly optimize budget allocation, sparse configuration, and CPU-GPU execution overlap. This co-design enables hybrid sparse attention to achieve both accuracy and system efficiency in long-context inference. Across 2 models, 3 benchmarks, and 40 tasks, Fluxion preserves quality well -- the worst average degradation is only -0.26 relative to FULL, while delivering 1.5$\times$-3.7$\times$ speedup over the strongest fixed sparse hybrid baseline, whose KV budget is only 0.05.
翻译:长上下文推理中,推理时KV状态超过GPU显存容量,或分解式预填充-解码系统将KV数据置于主机内存中,使得解码过程越来越依赖驻留于CPU的KV缓存。虽然块稀疏注意力在此场景下能降低注意力计算成本,但仅靠稀疏性不足以实现端到端效率。纯GPU设计仍受限于PCIe带宽和元数据内存开销,而CPU-GPU混合设计则面临GPU空闲时间过长、CPU端top-k选择和稀疏注意力计算成为瓶颈的问题。
Fluxion基于三项关键洞察构建:面向输出的KV预算分配、头专用与粒度感知的稀疏配置、以及针对驻留CPU的KV缓存的跨设备协同稀疏注意力执行。在这些洞察指导下,Fluxion结合轻量级头属性预测器、粒度预算选择器以及基于优先级的调度器,协同优化预算分配、稀疏配置与CPU-GPU执行重叠。这种协同设计使得混合稀疏注意力在长上下文推理中既能保持精度又能实现系统效率。在2个模型、3个基准测试及40个任务上,Fluxion保持了良好的质量——相比全注意力(FULL)的最差平均退化仅为-0.26,同时相对于最强固定稀疏混合基线(其KV预算仅为0.05)实现了1.5$\times$-3.7$\times$的加速比。