Large Language Models (LLMs) have achieved remarkable performance across a wide range of Natural Language Processing (NLP) tasks. However, in long-context scenarios, they face two challenges: high computational cost and information redundancy. To address these challenges, we propose GMSA, an encoder-decoder context compression framework that generates a compact sequence of soft tokens for downstream tasks. GMSA introduces Group Merging to achieve more uniform aggregation, mitigating semantic dominance during autoencoder pretraining, and Layer Semantic Alignment (LSA) to bridge the semantic gap between high-level abstract semantics and low-level input semantics. We first pretrain GMSA as an autoencoder and then fine-tune it for downstream tasks. Experiments demonstrate that GMSA improves context reconstruction compared to existing soft prompt compression paradigm and outperforms baselines on multiple long-context question answering and summarization benchmarks across two backbone models, while maintaining low end-to-end latency.
翻译:大型语言模型(LLM)在广泛的自然语言处理(NLP)任务中取得了显著性能。然而,在长上下文场景中,它们面临两个挑战:高计算成本与信息冗余。为应对这些挑战,我们提出GMSA——一种编码器-解码器上下文压缩框架,可为下游任务生成紧凑的软标记序列。GMSA引入分组合并以实现更均匀的聚合,缓解自编码器预训练期间的语义主导现象,并通过层级语义对齐(LSA)弥合高层抽象语义与低层输入语义之间的语义鸿沟。我们首先将GMSA作为自编码器进行预训练,随后针对下游任务进行微调。实验表明,与现有软提示压缩范式相比,GMSA提升了上下文重建质量,并在两种骨干模型的多个长上下文问答与摘要基准测试中优于基线方法,同时保持了较低的端到端延迟。