We introduce ScreenQA, a novel benchmarking dataset designed to advance screen content understanding through question answering. The existing screen datasets are focused either on low-level structural and component understanding, or on a much higher-level composite task such as navigation and task completion for autonomous agents. ScreenQA attempts to bridge this gap. By annotating 86k question-answer pairs over the RICO dataset, we aim to benchmark the screen reading comprehension capacity, thereby laying the foundation for vision-based automation over screenshots. Our annotations encompass full answers, short answer phrases, and corresponding UI contents with bounding boxes, enabling four subtasks to address various application scenarios. We evaluate the dataset's efficacy using both open-weight and proprietary models in zero-shot, fine-tuned, and transfer learning settings. We further demonstrate positive transfer to web applications, highlighting its potential beyond mobile applications.
翻译:本文介绍ScreenQA,一个旨在通过问答任务推进屏幕内容理解的新型基准数据集。现有屏幕数据集主要关注低层次的结构与组件理解,或更高层次的复合任务(如自主智能体的导航与任务完成)。ScreenQA试图填补这一空白。通过对RICO数据集标注8.6万个问答对,我们旨在建立屏幕阅读理解能力的评估基准,从而为基于视觉的截图自动化奠定基础。我们的标注包含完整答案、简短答案短语及带边界框的对应UI内容,支持四个子任务以应对不同应用场景。我们使用开源模型与专有模型在零样本、微调和迁移学习设置下评估数据集的有效性,并进一步展示了向网页应用的积极迁移效果,凸显了其在移动应用之外的潜力。