TaskBench: Benchmarking Large Language Models for Task Automation

In recent years, the remarkable progress of large language models (LLMs) has sparked interest in task automation, which involves decomposing complex tasks described by user instructions into sub-tasks and invoking external tools to execute them, playing a central role in autonomous agents. However, there is a lack of systematic and standardized benchmarks to promote the development of LLMs in task automation. To address this, we introduce TaskBench, a comprehensive framework to evaluate the capability of LLMs in task automation. Specifically, task automation can be divided into three critical stages: task decomposition, tool selection, and parameter prediction. To tackle the complexities inherent in these stages, we introduce the concept of Tool Graph to represent decomposed tasks and adopt a back-instruct method to generate high-quality user instructions. We propose TaskEval, a multi-faceted evaluation methodology that assesses LLM performance across these three stages. Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation. Experimental results demonstrate that TaskBench effectively reflects the capabilities of various LLMs in task automation. It provides insights into model performance across different task complexities and domains, pushing the boundaries of what current models can achieve. TaskBench offers a scalable, adaptable, and reliable benchmark for advancing LLM-based autonomous agents.

翻译：近年来，大型语言模型（LLMs）的显著进展激发了人们对任务自动化的兴趣。任务自动化涉及将用户指令描述的复杂任务分解为子任务，并调用外部工具执行这些子任务，在自主智能体中发挥着核心作用。然而，目前缺乏系统化、标准化的基准测试来推动LLMs在任务自动化方面的发展。为此，我们提出了TaskBench，一个用于评估LLMs任务自动化能力的综合框架。具体而言，任务自动化可分为三个关键阶段：任务分解、工具选择和参数预测。为应对这些阶段固有的复杂性，我们引入了工具图的概念来表示分解后的任务，并采用反向指令生成方法来生成高质量的用户指令。我们提出了TaskEval，一种多维度评估方法，用于评估LLM在这三个阶段的性能。我们的方法结合了自动构建与严格的人工验证，确保与人工评估结果高度一致。实验结果表明，TaskBench能有效反映各类LLM在任务自动化方面的能力。它提供了模型在不同任务复杂度和领域下的性能洞察，突破了当前模型的能力边界。TaskBench为推进基于LLM的自主智能体研究提供了一个可扩展、适应性强且可靠的基准测试平台。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日