MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback

To solve complex tasks, large language models (LLMs) often require multiple rounds of interactions with the user, sometimes assisted by external tools. However, current evaluation protocols often emphasize benchmark performance with single-turn exchanges, neglecting the nuanced interactions among the user, LLMs, and external tools, while also underestimating the importance of natural language feedback from users. These oversights contribute to discrepancies between research benchmark evaluations and real-world use cases. We introduce MINT, a benchmark that evaluates LLMs' ability to solve tasks with multi-turn interactions by (1) using tools and (2) leveraging natural language feedback. To ensure reproducibility, we provide an evaluation framework where LLMs can access tools by executing Python code and receive users' natural language feedback simulated by GPT-4. We repurpose a diverse set of established evaluation datasets focusing on reasoning, coding, and decision-making and carefully curate them into a compact subset for efficient evaluation. Our analysis of 20 open- and closed-source LLMs offers intriguing findings. (a) LLMs generally benefit from tools and language feedback, with performance gains (absolute, same below) of 1-8% for each turn of tool use and 2-17% with natural language feedback. (b) Better single-turn performance does not guarantee better multi-turn performance. (c) Surprisingly, on the LLMs evaluated, supervised instruction-finetuning (SIFT) and reinforcement learning from human feedback (RLHF) generally hurt multi-turn capabilities. We expect MINT can help measure progress and incentivize research in improving LLMs' capabilities in multi-turn interactions, especially for open-source communities where multi-turn human evaluation can be less accessible compared to commercial LLMs with a larger user base.

翻译：摘要：为解决复杂任务，大语言模型（LLMs）通常需要与用户进行多轮交互，有时还需借助外部工具。然而，当前的评估协议往往侧重单轮交互的基准性能，忽略了用户、LLMs与外部工具之间的细微交互，同时也低估了用户自然语言反馈的重要性。这些疏漏导致研究基准评估与实际应用场景之间存在偏差。我们提出MINT基准，通过（1）使用工具和（2）利用自然语言反馈，评估LLMs在多轮交互中完成任务的能力。为确保可重复性，我们提供了一个评估框架：LLMs可通过执行Python代码访问工具，并接收由GPT-4模拟的用户自然语言反馈。我们重新利用一组涵盖推理、编码和决策的多样化已有评估数据集，并精心筛选出一个紧凑子集以实现高效评估。对20个开源和闭源LLMs的分析揭示了有趣发现：（a）LLMs普遍受益于工具和语言反馈，每轮工具使用可带来1-8%的性能提升（绝对值，下同），自然语言反馈可带来2-17%的提升；（b）更好的单轮性能并不保证更好的多轮性能；（c）令人惊讶的是，在被评估的LLMs中，监督式指令微调（SIFT）和基于人类反馈的强化学习（RLHF）通常损害多轮能力。我们期望MINT能帮助衡量进展，并激励研究提升LLMs在多轮交互中的能力，尤其是在与拥有更大用户群体的商业LLMs相比、多轮人工评估较难实现的开源社区中。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日