Large language models (LLMs) suffer significant performance degradation when user instructions and context are distributed over multiple conversational turns, yet multi-turn (MT) interactions dominate chat interfaces. The routine approach of appending full chat history to prompts rapidly exhausts context windows, leading to increased latency, higher computational costs, and diminishing returns as conversations extend. We introduce MT-OSC, a One-off Sequential Condensation framework that efficiently and automatically condenses chat history in the background without disrupting the user experience. MT-OSC employs a Condenser Agent that uses a few-shot inference-based Condenser and a lightweight Decider to selectively retain essential information, reducing token counts by up to 72% in 10-turn dialogues. Evaluated across 13 state-of-the-art LLMs and diverse multi-turn benchmarks, MT-OSC consistently narrows the multi-turn performance gap - yielding improved or preserved accuracy across datasets while remaining robust to distractors and irrelevant turns. Our results establish MT-OSC as a scalable solution for multi-turn chats, enabling richer context within constrained input spaces, reducing latency and operational cost, while balancing performance.
翻译:大型语言模型(LLM)在用户指令与上下文分散于多个对话轮次时,会遭遇显著的性能下降,然而多轮交互主导着聊天界面的应用场景。常规方法将完整聊天历史追加至提示词中,会迅速耗尽上下文窗口,随着对话延长导致延迟增加、计算成本升高及收益递减。我们提出MT-OSC(一次性序列压缩框架),该框架能在背景中高效自动压缩聊天历史而不影响用户体验。MT-OSC采用由基于少样本推理的压缩器与轻量级决策器组成的压缩智能体,通过选择性保留关键信息,在10轮对话中可将令牌数量最高减少72%。经13个最新LLM及多样化多轮基准测试评估,MT-OSC持续缩小多轮性能差距——在各类数据集上实现准确性提升或保持,同时对抗干扰项与无关轮次具有鲁棒性。我们的结果确立了MT-OSC作为多轮对话可扩展解决方案的地位,能够在受限输入空间中容纳更丰富的上下文,降低延迟与运行成本,并平衡性能表现。