Much of the previous work towards digital agents for graphical user interfaces (GUIs) has relied on text-based representations (derived from HTML or other structured data sources), which are not always readily available. These input representations have been often coupled with custom, task-specific action spaces. This paper focuses on creating agents that interact with the digital world using the same conceptual interface that humans commonly use -- via pixel-based screenshots and a generic action space corresponding to keyboard and mouse actions. Building upon recent progress in pixel-based pretraining, we show, for the first time, that it is possible for such agents to outperform human crowdworkers on the MiniWob++ benchmark of GUI-based instruction following tasks.
翻译:以往关于图形用户界面(GUI)数字代理的研究大多依赖于基于文本的表征(源自HTML或其他结构化数据源),这些表征并非总是易于获取。这些输入表征通常与定制化的、特定任务的操作空间相结合。本文聚焦于创建能够利用人类普遍使用的同一概念界面——即基于像素的屏幕截图以及对应于键盘和鼠标操作的通用操作空间——与数字世界进行交互的代理。基于像素级预训练的最新进展,我们首次证明,在基于GUI的指令跟踪任务基准MiniWob++上,此类代理能够超越人类众包工作者。