Large language models (LLMs) process entire input contexts indiscriminately, which is inefficient when the information required to answer a query is localized within the context. We present dynamic context cutoff, a novel method enabling LLMs to self-terminate processing upon acquiring sufficient task-relevant information. Through analysis of model internals, we discover that specific attention heads inherently encode "sufficiency signals" -- detectable through lightweight classifiers -- that predict when critical information has been processed. This reveals a new efficiency paradigm: models' internal understanding naturally dictates processing needs rather than external compression heuristics. Comprehensive experiments across six QA datasets (up to 40K tokens) with three model families (LLaMA/Qwen/Mistral, 1B-70B) demonstrate 3.4% accuracy improvement while achieving 1.33x token reduction on average. Furthermore, our method demonstrates superior performance compared to other context efficiency methods at equivalent token reduction rates. Additionally, we observe an emergent scaling phenomenon: while smaller models require probing for sufficiency detection, larger models exhibit intrinsic self-assessment capabilities through prompting.
翻译:大型语言模型(LLMs)在处理输入上下文时通常不加区分地处理全部内容,当回答查询所需信息仅局部存在于上下文中时,这种处理方式效率低下。本文提出动态上下文截断方法,使LLMs能够在获取足够任务相关信息后自主终止处理过程。通过对模型内部机制的分析,我们发现特定注意力头会固有地编码"充分性信号"——可通过轻量级分类器检测——这些信号能够预测关键信息是否已被处理。这揭示了一种新的效率范式:模型内部理解自然决定处理需求,而非依赖外部压缩启发式方法。在六个问答数据集(最长40K词元)和三种模型系列(LLaMA/Qwen/Mistral,1B-70B参数规模)上的综合实验表明,该方法在平均减少1.33倍词元处理量的同时实现了3.4%的准确率提升。此外,在相同词元缩减率下,本方法较其他上下文效率优化方法表现出更优越的性能。值得注意的是,我们观察到一种新兴的尺度现象:较小模型需要通过探测机制实现充分性检测,而较大模型则通过提示机制展现出内在的自我评估能力。