Recently, the incredible progress of large language models (LLMs) has ignited the spark of task automation, which decomposes the complex tasks described by user instructions into sub-tasks, and invokes external tools to execute them, and plays a central role in autonomous agents. However, there lacks a systematic and standardized benchmark to foster the development of LLMs in task automation. To this end, we introduce TaskBench to evaluate the capability of LLMs in task automation. Specifically, task automation can be formulated into three critical stages: task decomposition, tool invocation, and parameter prediction to fulfill user intent. This complexity makes data collection and evaluation more challenging compared to common NLP tasks. To generate high-quality evaluation datasets, we introduce the concept of Tool Graph to represent the decomposed tasks in user intent, and adopt a back-instruct method to simulate user instruction and annotations. Furthermore, we propose TaskEval to evaluate the capability of LLMs from different aspects, including task decomposition, tool invocation, and parameter prediction. Experimental results demonstrate that TaskBench can effectively reflects the capability of LLMs in task automation. Benefiting from the mixture of automated data construction and human verification, TaskBench achieves a high consistency compared to the human evaluation, which can be utilized as a comprehensive and faithful benchmark for LLM-based autonomous agents.
翻译:近期,大型语言模型(LLMs)的显著进展激发了任务自动化的研究热潮。任务自动化将用户指令描述的复杂任务分解为子任务,并调用外部工具执行这些子任务,在自主智能体中发挥着核心作用。然而,目前缺乏系统化、标准化的基准测试来推动LLMs在任务自动化领域的发展。为此,我们提出TaskBench以评估LLMs在任务自动化中的能力。具体而言,任务自动化可归纳为三个关键阶段:任务分解、工具调用和参数预测,以实现用户意图。这种复杂性使得数据收集和评估比常见的自然语言处理任务更具挑战性。为生成高质量评估数据集,我们引入"工具图谱"(Tool Graph)概念来表示用户意图中的分解任务,并采用反向指令方法模拟用户指令及标注。此外,我们提出TaskEval从任务分解、工具调用和参数预测等多维度评估LLMs能力。实验结果表明,TaskBench能有效反映LLMs在任务自动化方面的能力。得益于自动化数据构建与人工验证的混合方法,TaskBench与人类评估结果具有高度一致性,可作为基于LLM的自主智能体的全面且可靠的基准测试。