There is a growing interest in device-control systems that can interpret human natural language instructions and execute them on a digital device by directly controlling its user interface. We present a dataset for device-control research, Android in the Wild (AITW), which is orders of magnitude larger than current datasets. The dataset contains human demonstrations of device interactions, including the screens and actions, and corresponding natural language instructions. It consists of 715k episodes spanning 30k unique instructions, four versions of Android (v10-13),and eight device types (Pixel 2 XL to Pixel 6) with varying screen resolutions. It contains multi-step tasks that require semantic understanding of language and visual context. This dataset poses a new challenge: actions available through the user interface must be inferred from their visual appearance. And, instead of simple UI element-based actions, the action space consists of precise gestures (e.g., horizontal scrolls to operate carousel widgets). We organize our dataset to encourage robustness analysis of device-control systems, i.e., how well a system performs in the presence of new task descriptions, new applications, or new platform versions. We develop two agents and report performance across the dataset. The dataset is available at https://github.com/google-research/google-research/tree/master/android_in_the_wild.
翻译:随着能够理解人类自然语言指令并通过直接操控数字设备用户界面来执行指令的设备控制系统日益受到关注,我们提出了一个用于设备控制研究的数据集——Android in the Wild (AITW),其规模较现有数据集高出数个数量级。该数据集包含人类设备交互行为的示范数据,涵盖屏幕显示与操作动作,以及对应的自然语言指令。数据集共包含71.5万个交互片段,覆盖3万条独特指令、四个安卓系统版本(v10-13)及八种设备类型(Pixel 2 XL至Pixel 6),屏幕分辨率各异。其中包含需要语义理解语言与视觉上下文的多步骤任务。该数据集提出了新挑战:用户界面可执行的动作必须通过其视觉外观推断得出;且动作空间不再局限于简单的UI元素操作,而是由精确手势(例如操控轮播组件所需的水平滚动)构成。我们通过精心组织数据集的构成,鼓励对设备控制系统的鲁棒性分析——即系统在面对新任务描述、新应用或新平台版本时的表现。我们开发了两种智能体,并在整个数据集上报告了其性能表现。数据集获取地址:https://github.com/google-research/google-research/tree/master/android_in_the_wild