The paper investigates using a Large Language Model (LLM) to automatically perform web software tasks using click, scroll, and text input operations. Previous approaches, such as reinforcement learning (RL) or imitation learning, are inefficient to train and task-specific. Our method uses filtered Document Object Model (DOM) elements as observations and performs tasks step-by-step, sequentially generating small programs based on the current observations. We use in-context learning, either benefiting from a single manually provided example, or an automatically generated example based on a successful zero-shot trial. We evaluate the proposed method on the MiniWob++ benchmark. With only one in-context example, our WebWISE method achieves similar or better performance than other methods that require many demonstrations or trials.
翻译:本文研究如何利用大语言模型(LLM)通过点击、滚动和文本输入操作自动执行网页软件任务。以往的方法,如强化学习(RL)或模仿学习,训练效率低下且仅适用于特定任务。我们的方法使用经过筛选的文档对象模型(DOM)元素作为观测,并根据当前观测逐步生成小程序,以顺序方式执行任务。我们采用上下文学习,既可利用单个人工提供的示例,也可基于成功的零样本尝试自动生成示例。在MiniWob++基准上的评估表明,仅需一个上下文示例,我们的WebWISE方法即可达到或超越其他需要大量演示或尝试的方法的性能。