The reasoning capabilities of Large Language Models (LLMs) are typically developed through the single-turn reinforcement learning, whereas real-world applications often involve multi-turn interactions with human feedback, leading to a potential mismatch between training and deployment conditions. In this work, we study whether multi-turn training with human feedback is necessary for reasoning tasks. We compare conventional single-turn training with three multi-turn strategies and reach contrary conclusions to previous research. We find that models trained in a single-turn setting generalize effectively to both single- and multi-turn evaluations, while models trained with multi-turn strategies exhibit a significant degradation in single-turn reasoning performance. These results suggest that for tasks with complete information, robust single-turn training remains more effective and reliable, as multi-turn training with basic feedback provides limited benefits and can even degrade reasoning capabilities.
翻译:大型语言模型(LLMs)的推理能力通常通过单轮强化学习进行开发,而实际应用往往涉及多轮人机交互反馈,这可能导致训练条件与部署环境之间的不匹配。本研究旨在探讨针对推理任务,基于人类反馈的多轮训练是否具有必要性。通过比较传统单轮训练与三种多轮训练策略,我们得出了与先前研究相反的结论:在单轮训练设置下训练的模型能够有效泛化至单轮及多轮评估场景,而采用多轮策略训练的模型在单轮推理任务中表现出显著性能下降。这些结果表明,对于信息完整的任务,鲁棒的单轮训练仍然更具效力和可靠性,因为基于基础反馈的多轮训练带来的增益有限,甚至可能削弱模型的推理能力。