Efficient long-context processing remains a crucial challenge for contemporary large language models (LLMs), especially in resource-constrained environments. Soft compression architectures promise to extend effective context length by replacing long token sequences with smaller sets of learned compressed tokens. Yet, the limits of compressibility -- and when compression begins to erase task-relevant content -- remain underexplored. In this paper, we define token overflow as a regime in which compressed representations no longer contain sufficient information to answer a given query, and propose a methodology to characterize and detect it. In the xRAG soft-compression setting, we find that query-agnostic saturation statistics reliably separate compressed from uncompressed token representations, providing a practical tool for identifying compressed tokens but showing limited overflow detection capability. Lightweight probing classifiers over both query and context xRAG representations detect overflow with 0.72 AUC-ROC on average on HotpotQA, SQuADv2, and TriviaQA datasets, demonstrating that incorporating query information improves detection performance. These results advance from query-independent diagnostics to query-aware detectors, enabling low-cost pre-LLM gating to mitigate compression-induced errors.
翻译:高效的长上下文处理仍然是当代大语言模型(LLM)面临的关键挑战,尤其在资源受限的环境中。软压缩架构通过用更小的学习压缩令牌集替换长令牌序列,有望扩展有效上下文长度。然而,可压缩性的极限——以及压缩何时开始擦除任务相关内容——仍未得到充分探索。本文定义令牌溢出为一种状态,即压缩表示不再包含足够信息来回答给定查询,并提出一种方法来表征和检测它。在xRAG软压缩设置中,我们发现与查询无关的饱和统计量能可靠地区分压缩与未压缩令牌表示,为识别压缩令牌提供了实用工具,但显示出有限的溢出检测能力。基于查询和上下文xRAG表示的轻量级探测分类器在HotpotQA、SQuADv2和TriviaQA数据集上平均达到0.72 AUC-ROC的溢出检测性能,表明融入查询信息可提升检测效果。这些结果实现了从查询无关诊断到查询感知检测器的进步,使得低成本的LLM前门控成为可能,从而减轻压缩引发的错误。