Long-context inference with Large Language Models (LLMs) is costly due to quadratic attention and growing key-value caches, motivating context compression. In this work, we study soft context compression, where a long context is condensed into a small set of continuous representations. Existing methods typically re-purpose the LLM itself as a trainable compressor, relying on layer-by-layer self-attention to iteratively aggregate information. We argue that this paradigm suffers from two structural limitations: (i) progressive representation overwriting across layers (ii) uncoordinated allocation of compression capacity across tokens. We propose ComprExIT (Context Compression via Explicit Information Transmission), a lightweight framework that formulates soft compression into a new paradigm: explicit information transmission over frozen LLM hidden states. This decouples compression from the model's internal self-attention dynamics. ComprExIT performs (i) depth-wise transmission to selectively transmit multi-layer information into token anchors, mitigating progressive overwriting, and (ii) width-wise transmission to aggregate anchors into a small number of slots via a globally optimized transmission plan, ensuring coordinated allocation of information. Across six question-answering benchmarks, ComprExIT consistently outperforms state-of-the-art context compression methods while introducing only ~1% additional parameters, demonstrating that explicit and coordinated information transmission enables more effective and robust long-context compression.
翻译:大型语言模型(LLM)的长上下文推理因二次注意力机制和不断增长的键值缓存而成本高昂,这推动了上下文压缩的研究。本文研究软上下文压缩,即将长上下文压缩为一小组连续表示。现有方法通常将LLM本身重新用作可训练的压缩器,依赖逐层自注意力迭代聚合信息。我们认为该范式存在两个结构性局限:(i)跨层的渐进式表示覆盖;(ii)跨词元的压缩容量分配不协调。我们提出ComprExIT(基于显式信息传输的上下文压缩),这是一个轻量级框架,将软压缩构建为一种新范式:在冻结的LLM隐藏状态上进行显式信息传输。这使压缩过程与模型内部的自注意力动态解耦。ComprExIT执行(i)深度维传输:将多层信息选择性地传输至词元锚点,缓解渐进覆盖问题;(ii)宽度维传输:通过全局优化的传输方案将锚点聚合到少量槽位中,确保信息的协调分配。在六个问答基准测试中,ComprExIT始终优于最先进的上下文压缩方法,同时仅引入约1%的额外参数,证明显式且协调的信息传输能够实现更高效、更鲁棒的长上下文压缩。