ScreenAI: A Vision-Language Model for UI and Infographics Understanding

from arxiv, Revision notes: 1) In Appendix I, added dataset location for ScreenQA Short in Appendix I. 2) In Table 4, updated evaluation numbers for Screen Annotation and Complex Screen QA benchmarks as the datasets are updated. 3) Updated Figure 4 to reflect the changes in evaluation numbers described in 2). 4) Minor revisions in other places

Screen user interfaces (UIs) and infographics, sharing similar visual language and design principles, play important roles in human communication and human-machine interaction. We introduce ScreenAI, a vision-language model that specializes in UI and infographics understanding. Our model improves upon the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets. At the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements. We use these text annotations to describe screens to Large Language Models and automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. We run ablation studies to demonstrate the impact of these design choices. At only 5B parameters, ScreenAI achieves new state-of-the-artresults on UI- and infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and InfographicVQA) compared to models of similar size. Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering.

翻译：屏幕用户界面（UI）与信息图具有相似的视觉语言和设计原则，在人际沟通和人机交互中扮演重要角色。我们提出ScreenAI——一个专注于UI与信息图理解的视觉-语言模型。该模型在PaLI架构基础上，采用pix2struct的灵活分块策略进行改进，并在独特的数据集混合上完成训练。该数据集混合的核心是一项新颖的屏幕标注任务，要求模型识别UI元素的类型与位置。我们利用这些文本标注向大型语言模型描述屏幕，并自动规模化生成问答（QA）、UI导航及摘要训练数据集。通过消融实验，我们验证了这些设计选择的影响。仅含5B参数的ScreenAI在基于UI和信息图的任务（多页DocVQA、WebSRC、MoTIF和Widget Captioning）中取得了新的最优结果，并在其他任务（Chart QA、DocVQA和InfographicVQA）中与相似规模模型相比实现了同类最佳性能。最后，我们发布了三个新数据集：一个聚焦屏幕标注任务，另外两个针对问答任务。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/