Computer end users have spent billions of hours completing daily tasks like tabular data processing and project timeline scheduling. Most of these tasks are repetitive and error-prone, yet most end users lack the skill to automate these burdensome works. With the advent of large language models (LLMs), directing software with natural language user requests become a reachable goal. In this work, we propose a SheetCopilot agent that takes natural language task and control spreadsheet to fulfill the requirements. We propose a set of atomic actions as an abstraction of spreadsheet software functionalities. We further design a state machine-based task planning framework for LLMs to robustly interact with spreadsheets. We curate a representative dataset containing 221 spreadsheet control tasks and establish a fully automated evaluation pipeline for rigorously benchmarking the ability of LLMs in software control tasks. Our SheetCopilot correctly completes 44.3\% of tasks for a single generation, outperforming the strong code generation baseline by a wide margin. Our project page:https://sheetcopilot.github.io/.
翻译:计算机终端用户花费了数十亿小时完成诸如表格数据处理和项目时间表安排等日常任务。这些任务大多重复且易出错,然而大多数终端用户缺乏自动化处理这些繁琐工作的技能。随着大语言模型(LLMs)的出现,通过自然语言用户指令操控软件已成为可实现的目标。本研究提出SheetCopilot智能体,它接收自然语言任务并控制电子表格以完成需求。我们提出一组原子操作作为电子表格软件功能的抽象,并进一步设计基于状态机的任务规划框架,使大语言模型能稳健地与电子表格交互。我们整理了一个包含221个电子表格控制任务的代表性数据集,并建立了全自动化评估流水线,用于严格基准测试大语言模型在软件控制任务中的能力。我们的SheetCopilot在单次生成中正确完成44.3%的任务,远超强代码生成基线方法。项目页面:https://sheetcopilot.github.io/。