Large language models (LLMs) demonstrate strong capabilities across a wide range of complex tasks and are increasingly deployed at scale, placing significant demands on inference efficiency. Prior work typically decomposes inference into prefill and decode stages, with the decode stage dominating total latency. To reduce time and memory complexity in the decode stage, a line of work introduces sparse-attention algorithms. In this paper, we show, both empirically and theoretically, that sparse attention can paradoxically increase end-to-end complexity: information loss often induces significantly longer sequences, a phenomenon we term ``Less is Less'' (Lil). To mitigate the Lil problem, we propose an early-stopping algorithm that detects the threshold where information loss exceeds information gain during sparse decoding. Our early-stopping algorithm reduces token consumption by up to 90% with a marginal accuracy degradation of less than 2% across reasoning-intensive benchmarks.
翻译:大型语言模型(LLMs)在各类复杂任务中展现出强大能力,并正被大规模部署,这对推理效率提出了显著要求。先前工作通常将推理分解为预填充和解码两个阶段,其中解码阶段主导了总体延迟。为降低解码阶段的时间和内存复杂度,一系列研究引入了稀疏注意力算法。本文通过实证与理论分析表明,稀疏注意力可能矛盾地增加端到端复杂度:信息损失常导致序列显著延长,我们将此现象称为“少即是减”(Lil)。为缓解Lil问题,我们提出一种早停算法,用于检测稀疏解码过程中信息损失超过信息增益的阈值。该早停算法在推理密集型基准测试中,能以低于2%的精度损失为代价,将令牌消耗降低高达90%。