Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domain-specific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct is pretrained by learning to parse masked screenshots of web pages into simplified HTML. The web, with its richness of visual elements cleanly reflected in the HTML structure, provides a large source of pretraining data well suited to the diversity of downstream tasks. Intuitively, this objective subsumes common pretraining signals such as OCR, language modeling, image captioning. In addition to the novel pretraining strategy, we introduce a variable-resolution input representation and a more flexible integration of language and vision inputs, where language prompts such as questions are rendered directly on top of the input image. For the first time, we show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images.
翻译:视觉定位语言无处不在——其来源涵盖带插图的教科书、包含图像与表格的网页、以及带有按钮和表单的移动应用。或许正因为这种多样性,以往的研究通常依赖领域特定的方案,而这些方案在底层数据、模型架构和训练目标上的共享十分有限。我们提出Pix2Struct,一个用于纯视觉语言理解的预训练图像到文本模型,该模型可在包含视觉定位语言的任务上进行微调。Pix2Struct通过将网页的蒙版截图解析为简化HTML进行预训练。网页的视觉元素丰富性直接体现在HTML结构中,由此提供了适合下游任务多样性的海量预训练数据。直观而言,这一优化目标涵盖了如OCR、语言建模和图像描述等常见的预训练信号。除了新颖的预训练策略外,我们还引入了可变分辨率输入表示,以及语言与视觉输入更灵活的融合方式——例如将问题等语言提示直接渲染在输入图像上。我们首次证明,单一预训练模型能够在四个领域(文档、插图、用户界面和自然图像)共九项任务中的六项上达到最先进的水平。