Building agents using large language models (LLMs) to control computers is an emerging research field, where the agent perceives computer states and performs actions to accomplish complex tasks. Previous computer agents have demonstrated the benefits of in-context learning (ICL); however, their performance is hindered by several issues. First, the limited context length of LLMs and complex computer states restrict the number of exemplars, as a single webpage can consume the entire context. Second, the exemplars in current methods, such as high-level plans and multi-choice questions, cannot represent complete trajectories, leading to suboptimal performance in tasks that require many steps or repeated actions. Third, existing computer agents rely on task-specific exemplars and overlook the similarity among tasks, resulting in poor generalization to novel tasks. To address these challenges, we introduce Synapse, featuring three key components: i) state abstraction, which filters out task-irrelevant information from raw states, allowing more exemplars within the limited context, ii) trajectory-as-exemplar prompting, which prompts the LLM with complete trajectories of the abstracted states and actions for improved multi-step decision-making, and iii) exemplar memory, which stores the embeddings of exemplars and retrieves them via similarity search for generalization to novel tasks. We evaluate Synapse on MiniWoB++, a standard task suite, and Mind2Web, a real-world website benchmark. In MiniWoB++, Synapse achieves a 99.2% average success rate (a 10% relative improvement) across 64 tasks using demonstrations from only 48 tasks. Notably, Synapse is the first ICL method to solve the book-flight task in MiniWoB++. Synapse also exhibits a 53% relative improvement in average step success rate over the previous state-of-the-art prompting scheme in Mind2Web.
翻译:摘要:利用大型语言模型(LLM)构建计算机控制代理是一个新兴研究领域,代理通过感知计算机状态并执行操作来完成复杂任务。以往计算机代理已展示出上下文学习(ICL)的优势,但其性能受到若干问题制约:首先,LLM有限的上下文长度与复杂的计算机状态限制了示例数量——单个网页即可耗尽全部上下文;其次,当前方法中的示例(如高层规划与多项选择题)无法表征完整轨迹,导致需要多步骤或重复操作的任务表现欠佳;第三,现有计算机代理依赖任务特定示例且忽视任务间相似性,难以泛化至新任务。为解决上述挑战,我们提出Synapse,包含三大核心组件:i)状态抽象——过滤原始状态中无关任务的信息,使有限上下文能容纳更多示例;ii)轨迹范例提示——以抽象状态与操作的完整轨迹提示LLM,提升多步决策能力;iii)示例记忆——存储示例嵌入并通过相似性搜索进行检索,实现对新任务的泛化。我们在标准任务套件MiniWoB++与真实网站基准Mind2Web上评估Synapse。在MiniWoB++中,Synapse仅使用48个任务的演示即可在64个任务上达到99.2%的平均成功率(相对提升10%)。值得注意的是,Synapse是首个解决MiniWoB++中预订航班任务的ICL方法。在Mind2Web上,Synapse的平均步骤成功率相较于此前最优提示方案实现53%的相对提升。