TaskBench: Benchmarking Large Language Models for Task Automation

Recently, the incredible progress of large language models (LLMs) has ignited the spark of task automation, which decomposes the complex tasks described by user instructions into sub-tasks, and invokes external tools to execute them, and plays a central role in autonomous agents. However, there lacks a systematic and standardized benchmark to foster the development of LLMs in task automation. To this end, we introduce TaskBench to evaluate the capability of LLMs in task automation. Specifically, task automation can be formulated into three critical stages: task decomposition, tool invocation, and parameter prediction to fulfill user intent. This complexity makes data collection and evaluation more challenging compared to common NLP tasks. To generate high-quality evaluation datasets, we introduce the concept of Tool Graph to represent the decomposed tasks in user intent, and adopt a back-instruct method to simulate user instruction and annotations. Furthermore, we propose TaskEval to evaluate the capability of LLMs from different aspects, including task decomposition, tool invocation, and parameter prediction. Experimental results demonstrate that TaskBench can effectively reflects the capability of LLMs in task automation. Benefiting from the mixture of automated data construction and human verification, TaskBench achieves a high consistency compared to the human evaluation, which can be utilized as a comprehensive and faithful benchmark for LLM-based autonomous agents.

翻译：近期，大型语言模型（LLMs）的显著进展激发了任务自动化的研究热潮。任务自动化将用户指令描述的复杂任务分解为子任务，并调用外部工具执行这些子任务，在自主智能体中发挥着核心作用。然而，目前缺乏系统化、标准化的基准测试来推动LLMs在任务自动化领域的发展。为此，我们提出TaskBench以评估LLMs在任务自动化中的能力。具体而言，任务自动化可归纳为三个关键阶段：任务分解、工具调用和参数预测，以实现用户意图。这种复杂性使得数据收集和评估比常见的自然语言处理任务更具挑战性。为生成高质量评估数据集，我们引入"工具图谱"（Tool Graph）概念来表示用户意图中的分解任务，并采用反向指令方法模拟用户指令及标注。此外，我们提出TaskEval从任务分解、工具调用和参数预测等多维度评估LLMs能力。实验结果表明，TaskBench能有效反映LLMs在任务自动化方面的能力。得益于自动化数据构建与人工验证的混合方法，TaskBench与人类评估结果具有高度一致性，可作为基于LLM的自主智能体的全面且可靠的基准测试。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日