TaskBench: Benchmarking Large Language Models for Task Automation

Recently, the incredible progress of large language models (LLMs) has ignited the spark of task automation, which decomposes the complex tasks described by user instructions into sub-tasks, and invokes external tools to execute them, and plays a central role in autonomous agents. However, there lacks a systematic and standardized benchmark to foster the development of LLMs in task automation. To this end, we introduce TaskBench to evaluate the capability of LLMs in task automation. Specifically, task automation can be formulated into three critical stages: task decomposition, tool invocation, and parameter prediction to fulfill user intent. This complexity makes data collection and evaluation more challenging compared to common NLP tasks. To generate high-quality evaluation datasets, we introduce the concept of Tool Graph to represent the decomposed tasks in user intent, and adopt a back-instruct method to simulate user instruction and annotations. Furthermore, we propose TaskEval to evaluate the capability of LLMs from different aspects, including task decomposition, tool invocation, and parameter prediction. Experimental results demonstrate that TaskBench can effectively reflects the capability of LLMs in task automation. Benefiting from the mixture of automated data construction and human verification, TaskBench achieves a high consistency compared to the human evaluation, which can be utilized as a comprehensive and faithful benchmark for LLM-based autonomous agents.

翻译：近期，大语言模型的显著进展点燃了任务自动化的火花——通过将用户指令描述的复杂任务分解为子任务，并调用外部工具执行，该方法在自主智能体中扮演核心角色。然而，目前尚缺乏系统化、标准化的基准测试来推动大语言模型在任务自动化领域的发展。为此，我们提出TaskBench以评估大语言模型在任务自动化中的能力。具体而言，任务自动化可归结为三个关键阶段：任务分解、工具调用及参数预测，以实现用户意图。相较于常规自然语言处理任务，这种复杂性使得数据收集与评估更具挑战性。为生成高质量评估数据集，我们引入"工具图"概念对用户意图中的分解任务进行表示，并采用反向指令方法模拟用户指令与标注。此外，我们提出TaskEval从任务分解、工具调用和参数预测等不同维度评估大语言模型能力。实验结果表明，TaskBench能有效反映大语言模型在任务自动化中的能力。得益于自动化数据构建与人工验证的混合机制，TaskBench达到了与人类评估高度的一致性，可作为基于大语言模型的自主智能体的全面且可靠的基准。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日