To solve complex tasks, large language models (LLMs) often require multiple rounds of interactions with the user, sometimes assisted by external tools. However, current evaluation paradigms often focus solely on benchmark performance with single-turn exchanges, neglecting the intricate interactions among the user, LLMs, and external tools, creating a discrepancy between benchmark evaluation and real-world use cases. We introduce MINT benchmark to evaluate LLMs' ability to solve tasks with multi-turn interactions by (1) using tools and (2) leveraging natural language feedback. To ensure reproducibility, we provide an evaluation framework where LLMs can access tools by executing Python code and receive natural language feedback from the user simulated with GPT-4. We repurpose a diverse set of established datasets and tasks focusing on reasoning, coding, and decision-making and carefully curate them into a compact subset of instances for efficient evaluation. Our analysis of 20 open- and closed-source LLMs offers intriguing findings. (1) LLMs generally benefit from tool interactions and language feedback, with performance gains (absolute, same below) of 1--8% per additional turn with tool use and 2--17% with natural language feedback. (2) Better single-turn performance does not guarantee better multi-turn performance. (3) Surprisingly, on LLMs we evaluated, we found supervised instruction-finetuning (SIFT) and reinforcement learning from human feedback (RLHF) generally hurt multi-turn capabilities. We hope MINT can help measure progress and incentivize research in improving LLMs' capabilities in multi-turn interactions, especially for open-source communities where multi-turn human evaluation has been less accessible compared to commercial LLMs with a larger user base.
翻译:为解决复杂任务,大语言模型通常需要与用户进行多轮交互,有时还需借助外部工具。然而,当前评估范式多聚焦于单轮对话的基准性能,忽略了用户、大语言模型与外部工具间的复杂交互,导致基准评估与现实应用场景存在脱节。我们提出MINT基准,通过以下两个方面评估大语言模型在多轮交互中完成任务的能力:(1)使用工具;(2)利用自然语言反馈。为确保可复现性,我们构建了一个评估框架:大语言模型可通过执行Python代码访问工具,并从由GPT-4模拟的用户处接收自然语言反馈。我们对多个涵盖推理、编码与决策的现有数据集与任务进行重构,并精心筛选出紧凑的实例子集以实现高效评估。对20个开源与闭源大语言模型的分析揭示出若干有趣发现:(1)大语言模型普遍受益于工具交互与语言反馈,每增加一轮工具交互,性能提升1%-8%(绝对值,下同),每增加一轮自然语言反馈,性能提升2%-17%;(2)单轮性能更优并不保证多轮性能更佳;(3)令人意外的是,在我们评估的大语言模型中,监督式指令微调与基于人类反馈的强化学习普遍削弱了多轮交互能力。我们期望MINT能够助力衡量进展,并激励提升大语言模型多轮交互能力的研究,尤其对于开源社区而言,其多轮人工评估的可行性远低于拥有庞大用户群体的商业大语言模型。