Large Language Models (LLMs) demonstrate exceptional capability across diverse tasks. However, their deployment in long-context scenarios is hindered by two challenges: computational inefficiency and redundant information. We propose RAM (Read As HuMan), a context compression framework that adopts an adaptive hybrid reading strategy, to address these challenges. Inspired by human reading behavior (i.e., close reading important content while skimming less relevant content), RAM partitions the context into segments and encodes them with the input query in parallel. High-relevance segments are fully retained (close reading), while low-relevance ones are query-guided compressed into compact summary vectors (skimming). Both explicit textual segments and implicit summary vectors are concatenated and fed into decoder to achieve both superior performance and natural language format interpretability. To refine the decision boundary between close reading and skimming, we further introduce a contrastive learning objective based on positive and negative query-segment pairs. Experiments demonstrate that RAM outperforms existing baselines on multiple question answering and summarization benchmarks across two backbones, while delivering up to a 12x end-to-end speedup on long inputs (average length 16K; maximum length 32K).
翻译:大型语言模型(LLM)在各种任务中展现出卓越的能力。然而,其在长上下文场景中的部署面临两大挑战:计算效率低下与信息冗余。为此,我们提出RAM(Read As HuMan)——一种采用自适应混合阅读策略的上下文压缩框架。受人类阅读行为(即对重要内容进行精读,对相关性较低的内容进行略读)的启发,RAM将上下文划分为多个片段,并使其与输入查询并行编码。高相关性片段被完整保留(精读),而低相关性片段则在查询引导下被压缩为紧凑的摘要向量(略读)。显式的文本片段与隐式的摘要向量被拼接后输入解码器,从而同时实现优异的性能与自然语言格式的可解释性。为优化精读与略读之间的决策边界,我们进一步引入了基于正负查询-片段对的对比学习目标。实验表明,在两个骨干模型上,RAM在多个问答与摘要基准测试中均优于现有基线,并在长输入(平均长度16K;最大长度32K)上实现了高达12倍的端到端加速。