Computer-use agents (CUAs) powered by large language models (LLMs) have emerged as a promising approach to automating computer tasks, yet they struggle with the existing human-oriented OS interfaces - graphical user interfaces (GUIs). GUIs force LLMs to decompose high-level goals into lengthy, error-prone sequences of fine-grained actions, resulting in low success rates and an excessive number of LLM calls. We propose Declarative Model Interface (DMI), an abstraction that transforms existing GUIs into three declarative primitives: access, state, and observation, thereby providing novel OS interfaces tailored for LLM agents. Our key idea is policy-mechanism separation: LLMs focus on high-level semantic planning (policy) while DMI handles low-level navigation and interaction (mechanism). DMI does not require modifying the application source code or relying on application programming interfaces (APIs). We evaluate DMI with Microsoft Office Suite (Word, PowerPoint, Excel) on Windows. Integrating DMI into a leading GUI-based agent baseline improves task success rates by 67% and reduces interaction steps by 43.5%. Notably, DMI completes over 61% of successful tasks with a single LLM call.
翻译:大型语言模型驱动的计算机使用代理已成为自动化计算机任务的有前景方法,然而它们在现有面向人类的操作系统接口——图形用户界面中面临挑战。图形用户界面迫使大型语言模型将高层目标分解为冗长且易出错的细粒度动作序列,导致成功率低且大语言模型调用次数过多。我们提出声明式模型接口(Declarative Model Interface, DMI),这是一种将现有图形用户界面转化为三种声明式原语的抽象:访问、状态与观察,从而为大型语言模型代理提供新型操作系统接口。我们的核心思想是策略-机制分离:大型语言模型专注于高层语义规划(策略),而DMI负责底层导航与交互(机制)。DMI无需修改应用程序源代码或依赖应用程序编程接口。我们通过Windows平台上的Microsoft Office套件(Word、PowerPoint、Excel)评估DMI。将DMI集成至基于图形用户界面的主流代理基线后,任务成功率提升67%,交互步骤减少43.5%。值得注意的是,DMI在超过61%的成功任务中仅需单次大语言模型调用即可完成。