In recent years, the remarkable progress of large language models (LLMs) has sparked interest in task automation, which involves decomposing complex tasks described by user instructions into sub-tasks and invoking external tools to execute them, playing a central role in autonomous agents. However, there is a lack of systematic and standardized benchmarks to promote the development of LLMs in task automation. To address this, we introduce TaskBench, a comprehensive framework to evaluate the capability of LLMs in task automation. Specifically, task automation can be divided into three critical stages: task decomposition, tool selection, and parameter prediction. To tackle the complexities inherent in these stages, we introduce the concept of Tool Graph to represent decomposed tasks and adopt a back-instruct method to generate high-quality user instructions. We propose TaskEval, a multi-faceted evaluation methodology that assesses LLM performance across these three stages. Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation. Experimental results demonstrate that TaskBench effectively reflects the capabilities of various LLMs in task automation. It provides insights into model performance across different task complexities and domains, pushing the boundaries of what current models can achieve. TaskBench offers a scalable, adaptable, and reliable benchmark for advancing LLM-based autonomous agents.
翻译:近年来,大型语言模型(LLMs)的显著进展激发了人们对任务自动化的兴趣。任务自动化涉及将用户指令描述的复杂任务分解为子任务,并调用外部工具执行这些子任务,在自主智能体中发挥着核心作用。然而,目前缺乏系统化、标准化的基准测试来推动LLMs在任务自动化方面的发展。为此,我们提出了TaskBench,一个用于评估LLMs任务自动化能力的综合框架。具体而言,任务自动化可分为三个关键阶段:任务分解、工具选择和参数预测。为应对这些阶段固有的复杂性,我们引入了工具图的概念来表示分解后的任务,并采用反向指令生成方法来生成高质量的用户指令。我们提出了TaskEval,一种多维度评估方法,用于评估LLM在这三个阶段的性能。我们的方法结合了自动构建与严格的人工验证,确保与人工评估结果高度一致。实验结果表明,TaskBench能有效反映各类LLM在任务自动化方面的能力。它提供了模型在不同任务复杂度和领域下的性能洞察,突破了当前模型的能力边界。TaskBench为推进基于LLM的自主智能体研究提供了一个可扩展、适应性强且可靠的基准测试平台。