Screen user interfaces (UIs) and infographics, sharing similar visual language and design principles, play important roles in human communication and human-machine interaction. We introduce ScreenAI, a vision-language model that specializes in UI and infographics understanding. Our model improves upon the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets. At the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements. We use these text annotations to describe screens to Large Language Models and automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. We run ablation studies to demonstrate the impact of these design choices. At only 5B parameters, ScreenAI achieves new state-of-the-artresults on UI- and infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and InfographicVQA) compared to models of similar size. Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering.
翻译:屏幕用户界面(UI)与信息图共享相似的视觉语言与设计原则,在人际沟通与人机交互中发挥着重要作用。我们提出ScreenAI,一种专精于UI与信息图理解的视觉语言模型。该模型通过采用pix2struct的灵活分块策略对PaLI架构进行改进,并基于独特的数据集混合进行训练。该混合数据集的核心在于一种新型屏幕标注任务:模型需识别UI元素的类型与位置。我们利用这些文本标注向大型语言模型描述屏幕,从而大规模自动生成问答(QA)、UI导航及摘要训练数据集。通过消融研究,我们验证了这些设计选择的影响。在仅有5B参数规模下,ScreenAI在基于UI与信息图的任务(多页面DocVQA、WebSRC、MoTIF及Widget Captioning)上达到新最优结果,并在与其他同规模模型的对比中,于图表问答(Chart QA)、DocVQA及InfographicVQA等任务上取得同类最佳性能。最后,我们发布了三个新数据集:一个专注于屏幕标注任务,另两个专注于问答任务。