Large Language Models (LLMs) achieve strong performance on many reasoning benchmarks, yet these evaluations typically focus on isolated tasks that differ from real-world usage in task-oriented dialogue (TOD). In this setting, LLMs must perform reasoning inherently while generating text and adhering to instructions on role, format, and style. This mismatch raises concerns about whether benchmark performance accurately reflects models' reasoning robustness in TOD setting. We investigate how framing reasoning tasks within TOD affects LLM performance by introducing BOULDER, a new dynamic benchmark covering eight travel-related tasks that require arithmetic, spatial, and temporal reasoning with both commonsense and formal aspects. Each problem is presented in both isolated and dialogue-based variants, enabling controlled comparison while mitigating data contamination. Experiments on eight LLMs reveal a substantial and consistent performance gap between isolated and dialogue settings. Through ablations and qualitative analysis, we show that this gap is largely driven by the multi-turn nature of dialogue, with additional effects from role conditioning and tool-use requirements. Our results highlight the need to evaluate LLM reasoning in realistic interactive scenarios.
翻译:大型语言模型(LLMs)在众多推理基准测试中表现出色,但这些评估通常局限于与任务导向型对话(TOD)实际应用场景迥异的孤立任务。在TOD场景中,LLMs必须在生成文本的同时,遵循角色、格式和风格指令,进行内隐式推理。这种差异引发担忧:基准测试性能是否能准确反映模型在TOD场景中的推理鲁棒性。我们通过引入BOULDER——一个覆盖八类旅行相关任务的新动态基准,探究在TOD框架内设置推理任务如何影响LLM性能。该基准包含算术、空间、时间推理任务,兼具常识性与形式化特征。每个问题均以孤立和对话两种变体呈现,在控制数据污染的同时实现对比分析。针对八个LLMs的实验显示,孤立设置与对话设置之间存在显著且一致性的性能差距。通过消融实验与定性分析,我们证明对话的多轮交互特性是导致该差距的主要因素,角色条件设定与工具使用需求亦有贡献。研究结果凸显了在真实交互场景中评估LLM推理能力的必要性。