Large language models (LLMs) often solve a task when all instructions are given in a single prompt, but fail when the same information is revealed gradually across turns. When a clean FULL prompt and a RAW-SHARDED conversation contain the same complete user evidence, the model should still arrive at the same answer. We argue that a key reason for this gap is self-anchored drift: responses produced under partial information introduce unsupported assumptions, and those assumptions later distort the final answer. To reduce this effect, we propose Canonical-Context On-Policy Distillation (CCOPD). During training, the same base model is used in two roles: a frozen teacher conditioned on the clean FULL prompt and a trainable student that receives the same evidence incrementally through a multi-turn conversation; CCOPD aligns the student's behavior on its own trajectories with the teacher's canonical full-context behavior. Trained only on math problem conversations, CCOPD yields a 32\% average relative improvement in RAW-SHARDED performance over the original base model across math and five zero-shot out-of-domain task families, while largely preserving full-context performance. Further analyses suggest that CCOPD strengthens grounding in user evidence and reduces sensitivity to contamination from earlier assistant turns.
翻译:大型语言模型(LLM)在单次提示中给出全部指令时通常能解决问题,但当相同信息逐步在多轮对话中呈现时却会失败。当完整的FULL提示和原始分片对话包含相同的完整用户证据时,模型仍应得出相同答案。我们认为造成这一差距的关键原因是自我锚定偏移:在部分信息条件下生成的回答引入了无依据的假设,这些假设随后扭曲了最终答案。为减少此效应,我们提出规范上下文在策略蒸馏(CCOPD)。训练过程中,同一基础模型被用于两个角色:以完整FULL提示为条件的冻结教师模型,以及通过多轮对话逐步接收相同证据的可训练学生模型;CCOPD使学生模型在其自身轨迹上的行为与教师模型的规范全上下文行为对齐。仅在数学问题对话上训练的CCOPD,在数学任务和五个零样本跨域任务族中,使原始分片性能相比原始基础模型平均提升32%,同时基本保持全上下文性能。进一步分析表明,CCOPD增强了用户证据的锚定性,并降低了对早期助手轮次污染物的敏感性。