When a user reveals task-critical information across several conversation turns, LLM accuracy drops by up to 65% despite full context availability. We show that this Lost in Conversation degradation can be substantially mitigated by training models to maintain a compact rolling memory instead of attending to a growing history. To make such training scalable, we introduce a low-cost sharding pipeline that converts single-turn QA datasets into multi-turn fragmented-information episodes, eliminating the need for hours of manual annotation. Training only on sharded GSM8K, our memory-augmented policy significantly improves multi-turn accuracy and generalises zero-shot to harder math and out-of-domain long-context QA. Moreover, memory-trained models outperform full-history baselines even when given the full history at test time, suggesting that learning to compress induces more robust incremental reasoning than full-context exposure alone.
翻译:当用户通过多个对话轮次逐步透露关键任务信息时,即便完整上下文始终可用,大语言模型的准确率仍会下降高达65%。我们证明,这种“对话中迷失”的性能退化,可通过训练模型维护紧凑滚动记忆(而非关注不断增长的对话历史)得到显著缓解。为实现此类训练的可扩展性,我们引入一种低成本分片流水线,可将单轮问答数据集转化为多轮碎片信息片段,从而免除手动标注的数小时工作量。仅基于分片后的GSM8K数据集进行训练,我们的记忆增强策略便显著提升了模型在多轮对话中的准确率,并零样本泛化至更难的数学问题及域外长上下文问答任务。此外,记忆训练模型在测试时即使获得完整历史信息,其表现仍优于完整历史基线模型,这表明学习信息压缩能诱导出比单纯暴露于完整上下文更稳健的增量推理能力。