In recent years, the remarkable progress of large language models (LLMs) has sparked interest in task automation, which involves decomposing complex tasks described by user instructions into sub-tasks and invoking external tools to execute them, playing a central role in autonomous agents. However, there is a lack of systematic and standardized benchmarks to promote the development of LLMs in task automation. To address this, we introduce TaskBench, a comprehensive framework to evaluate the capability of LLMs in task automation. Specifically, task automation can be divided into three critical stages: task decomposition, tool selection, and parameter prediction. To tackle the complexities inherent in these stages, we introduce the concept of Tool Graph to represent decomposed tasks and adopt a back-instruct method to generate high-quality user instructions. We propose TaskEval, a multi-faceted evaluation methodology that assesses LLM performance across these three stages. Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation. Experimental results demonstrate that TaskBench effectively reflects the capabilities of various LLMs in task automation. It provides insights into model performance across different task complexities and domains, pushing the boundaries of what current models can achieve. TaskBench offers a scalable, adaptable, and reliable benchmark for advancing LLM-based autonomous agents.
翻译:近年来,大型语言模型(LLMs)的显著进展激发了人们对任务自动化的兴趣。任务自动化涉及将用户指令描述的复杂任务分解为子任务,并调用外部工具执行这些子任务,在自主智能体中扮演着核心角色。然而,目前缺乏系统化、标准化的基准测试来推动LLMs在任务自动化领域的发展。为此,我们提出了TaskBench,一个用于评估LLMs任务自动化能力的综合框架。具体而言,任务自动化可分为三个关键阶段:任务分解、工具选择和参数预测。为应对这些阶段固有的复杂性,我们引入了工具图的概念来表示分解后的任务,并采用逆向指令方法生成高质量的用户指令。我们提出了TaskEval,一种多维度评估方法,用于衡量LLM在上述三个阶段的表现。我们的方法结合了自动构建与严格的人工验证,确保了与人工评估的高度一致性。实验结果表明,TaskBench能有效反映各类LLM在任务自动化方面的能力。它揭示了模型在不同任务复杂度与领域中的性能表现,突破了当前模型的能力边界。TaskBench为推进基于LLM的自主智能体研究提供了一个可扩展、适应性强且可靠的基准测试平台。