ToolQA: A Dataset for LLM Question Answering with External Tools

Large Language Models (LLMs) have demonstrated impressive performance in various NLP tasks, but they still suffer from challenges such as hallucination and weak numerical reasoning. To overcome these challenges, external tools can be used to enhance LLMs' question-answering abilities. However, current evaluation methods do not distinguish between questions that can be answered using LLMs' internal knowledge and those that require external information through tool use. To address this issue, we introduce a new dataset called ToolQA, which is designed to faithfully evaluate LLMs' ability to use external tools for question answering. Our development of ToolQA involved a scalable, automated process for dataset curation, along with 13 specialized tools designed for interaction with external knowledge in order to answer questions. Importantly, we strive to minimize the overlap between our benchmark data and LLMs' pre-training data, enabling a more precise evaluation of LLMs' tool-use reasoning abilities. We conducted an in-depth diagnosis of existing tool-use LLMs to highlight their strengths, weaknesses, and potential improvements. Our findings set a new benchmark for evaluating LLMs and suggest new directions for future advancements. Our data and code are freely available to the broader scientific community on GitHub.

翻译：大语言模型（LLMs）在各类自然语言处理任务中展现出卓越性能，但仍面临幻觉现象与数值推理能力薄弱等挑战。为克服这些难题，可借助外部工具增强LLMs的问答能力。然而，现有评估方法未能有效区分基于LLMs内部知识可回答的问题与需通过工具调用外部信息才能解答的问题。针对该问题，我们提出名为ToolQA的新数据集，旨在忠实评估LLMs运用外部工具进行问答的能力。ToolQA的开发包含可扩展的自动化数据集构建流程，并配套13种专用工具实现与外部知识的交互以解答问题。值得强调的是，我们力求最小化基准数据与LLMs预训练数据之间的重叠，从而更精准地评估LLMs的工具推理能力。通过深度诊断现有工具型LLMs，我们揭示了其优势、不足及潜在改进方向。本研究为LLMs评估树立了新基准，并为未来发展指明新路径。相关数据与代码已在GitHub上向科学界免费开放。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日