There is a growing interest in device-control systems that can interpret human natural language instructions and execute them on a digital device by directly controlling its user interface. We present a dataset for device-control research, Android in the Wild (AITW), which is orders of magnitude larger than current datasets. The dataset contains human demonstrations of device interactions, including the screens and actions, and corresponding natural language instructions. It consists of 715k episodes spanning 30k unique instructions, four versions of Android (v10-13),and eight device types (Pixel 2 XL to Pixel 6) with varying screen resolutions. It contains multi-step tasks that require semantic understanding of language and visual context. This dataset poses a new challenge: actions available through the user interface must be inferred from their visual appearance. And, instead of simple UI element-based actions, the action space consists of precise gestures (e.g., horizontal scrolls to operate carousel widgets). We organize our dataset to encourage robustness analysis of device-control systems, i.e., how well a system performs in the presence of new task descriptions, new applications, or new platform versions. We develop two agents and report performance across the dataset. The dataset is available at https://github.com/google-research/google-research/tree/master/android_in_the_wild.
翻译:随着能够理解人类自然语言指令、并直接通过操控数字设备用户界面来执行操作的设备控制系统日益受到关注,我们提出了一套用于设备控制研究的数据集——Android in the Wild(AITW),其规模比现有数据集高出数个数量级。该数据集包含人类设备交互的演示,涵盖屏幕画面、操作行为以及对应的自然语言指令。它包含71.5万次操作序列,覆盖3万条独特指令、四个Android版本(v10-13)以及八种设备类型(Pixel 2 XL至Pixel 6)且具备不同屏幕分辨率。其中包含需要语义理解语言及视觉上下文的多步骤任务。该数据集提出了新挑战:用户界面中可执行的操作必须通过其视觉外观来推断。并且,与基于简单UI元素的操作不同,操作空间包含精确手势(例如,水平滚动操作轮播控件)。我们组织数据集旨在促进设备控制系统的鲁棒性分析,即系统在面对新任务描述、新应用程序或新平台版本时的表现。我们构建了两个智能体,并报告了其在整个数据集上的性能表现。该数据集可通过以下链接获取:https://github.com/google-research/google-research/tree/master/android_in_the_wild。