Recently, the incredible progress of large language models (LLMs) has ignited the spark of task automation, which decomposes the complex tasks described by user instructions into sub-tasks, and invokes external tools to execute them, and plays a central role in autonomous agents. However, there lacks a systematic and standardized benchmark to foster the development of LLMs in task automation. To this end, we introduce TaskBench to evaluate the capability of LLMs in task automation. Specifically, task automation can be formulated into three critical stages: task decomposition, tool invocation, and parameter prediction to fulfill user intent. This complexity makes data collection and evaluation more challenging compared to common NLP tasks. To generate high-quality evaluation datasets, we introduce the concept of Tool Graph to represent the decomposed tasks in user intent, and adopt a back-instruct method to simulate user instruction and annotations. Furthermore, we propose TaskEval to evaluate the capability of LLMs from different aspects, including task decomposition, tool invocation, and parameter prediction. Experimental results demonstrate that TaskBench can effectively reflects the capability of LLMs in task automation. Benefiting from the mixture of automated data construction and human verification, TaskBench achieves a high consistency compared to the human evaluation, which can be utilized as a comprehensive and faithful benchmark for LLM-based autonomous agents.
翻译:近期,大语言模型的显著进展点燃了任务自动化的火花——通过将用户指令描述的复杂任务分解为子任务,并调用外部工具执行,该方法在自主智能体中扮演核心角色。然而,目前尚缺乏系统化、标准化的基准测试来推动大语言模型在任务自动化领域的发展。为此,我们提出TaskBench以评估大语言模型在任务自动化中的能力。具体而言,任务自动化可归结为三个关键阶段:任务分解、工具调用及参数预测,以实现用户意图。相较于常规自然语言处理任务,这种复杂性使得数据收集与评估更具挑战性。为生成高质量评估数据集,我们引入"工具图"概念对用户意图中的分解任务进行表示,并采用反向指令方法模拟用户指令与标注。此外,我们提出TaskEval从任务分解、工具调用和参数预测等不同维度评估大语言模型能力。实验结果表明,TaskBench能有效反映大语言模型在任务自动化中的能力。得益于自动化数据构建与人工验证的混合机制,TaskBench达到了与人类评估高度的一致性,可作为基于大语言模型的自主智能体的全面且可靠的基准。