Large language models (LLMs) demonstrate strong capabilities across a wide range of complex tasks and are increasingly deployed at scale, placing significant demands on inference efficiency. Prior work typically decomposes inference into prefill and decode stages, with the decode stage dominating total latency. To reduce time and memory complexity in the decode stage, a line of work introduces sparse-attention algorithms. In this paper, we show, both empirically and theoretically, that sparse attention can paradoxically increase end-to-end complexity: information loss often induces significantly longer sequences, a phenomenon we term ``Less is Less'' (Lil). To mitigate the Lil problem, we propose an early-stopping algorithm that detects the threshold where information loss exceeds information gain during sparse decoding. Our early-stopping algorithm reduces token consumption by up to 90% with a marginal accuracy degradation of less than 2% across reasoning-intensive benchmarks.
翻译:大型语言模型(LLM)在广泛的复杂任务上展现出强大的能力,并正被大规模部署,这对推理效率提出了显著要求。先前的研究通常将推理分解为预填充和解码两个阶段,其中解码阶段主导了总体延迟。为降低解码阶段的时间和内存复杂度,一系列研究引入了稀疏注意力算法。本文通过实证与理论分析表明,稀疏注意力可能矛盾地增加端到端复杂度:信息丢失常常导致序列显著变长,我们将此现象称为“少即是劣”(Lil)。为缓解Lil问题,我们提出了一种早停算法,该算法能在稀疏解码过程中检测到信息损失超过信息增益的阈值。我们的早停算法在多个推理密集型基准测试中,将令牌消耗降低了高达90%,同时准确率下降幅度小于2%。