Multimodal large language models (MLLMs) have enabled LLM-based agents to directly interact with application user interfaces (UIs), enhancing agents' performance in complex tasks. However, these agents often suffer from high latency and low reliability due to the extensive sequential UI interactions. To address this issue, we propose AXIS, a novel LLM-based agents framework prioritize actions through application programming interfaces (APIs) over UI actions. This framework also facilitates the creation and expansion of APIs through automated exploration of applications. Our experiments on Office Word demonstrate that AXIS reduces task completion time by 65%-70% and cognitive workload by 38%-53%, while maintaining accuracy of 97%-98% compare to humans. Our work contributes to a new human-agent-computer interaction (HACI) framework and a fresh UI design principle for application providers in the era of LLMs. It also explores the possibility of turning every applications into agents, paving the way towards an agent-centric operating system (Agent OS).
翻译:多模态大语言模型(MLLMs)使得基于大语言模型(LLM)的智能体能够直接与应用程序用户界面(UI)进行交互,从而提升了智能体在复杂任务中的表现。然而,由于需要大量连续的UI交互,这类智能体往往存在延迟高、可靠性低的问题。为解决此问题,我们提出了AXIS——一种新颖的基于LLM的智能体框架,该框架优先通过应用程序编程接口(API)执行操作,而非依赖UI交互。该框架还通过自动化探索应用程序,促进了API的创建与扩展。我们在Office Word上进行的实验表明,与人类操作相比,AXIS将任务完成时间减少了65%-70%,认知负荷降低了38%-53%,同时保持了97%-98%的准确率。我们的工作为LLM时代贡献了一个新的人-智能体-计算机交互(HACI)框架,并为应用程序提供商提出了一种全新的UI设计原则。此外,本研究探索了将每个应用程序转化为智能体的可能性,为迈向以智能体为中心的操作系统(Agent OS)铺平了道路。