Multi-turn conversations are a common and critical mode of language model interaction. However, current open training and evaluation data focus on single-turn settings, failing to capture the additional dimension of these longer interactions. To understand this multi-/single-turn gap, we first introduce a new benchmark, TurnWiseEval, for multi-turn capabilities that is directly comparable to single-turn chat evaluation. Our evaluation isolates multi-turn specific conversational ability through pairwise comparison to equivalent single-turn settings. We additionally introduce our synthetic multi-turn data pipeline TurnWiseData which allows the scalable generation of multi-turn training data. Our experiments with Olmo 3 show that training with multi-turn data is vital to achieving strong multi-turn chat performance, and that including as little as 10k multi-turn conversations during post-training can lead to a 12% improvement on TurnWiseEval.
翻译:多轮对话是语言模型交互中一种常见且关键的模式。然而,当前公开的训练和评估数据主要集中于单轮设置,未能捕捉这些更长交互的额外维度。为了理解这种多轮/单轮能力差距,我们首先引入了一个新的多轮能力基准测试——TurnWiseEval,该基准可直接与单轮聊天评估进行对比。我们的评估通过与等效的单轮设置进行成对比较,从而分离出多轮对话特有的能力。此外,我们还介绍了我们合成的多轮数据生成流程 TurnWiseData,它允许可扩展地生成多轮训练数据。我们在 Olmo 3 上的实验表明,使用多轮数据进行训练对于实现强大的多轮聊天性能至关重要,并且在后期训练中仅加入约 1 万条多轮对话即可在 TurnWiseEval 上带来 12% 的性能提升。