Vision-language pretraining models have achieved great success in supporting multimedia applications by understanding the alignments between images and text. While existing vision-language pretraining models primarily focus on understanding single image associated with a single piece of text, they often ignore the alignment at the intra-document level, consisting of multiple sentences with multiple images. In this work, we propose DocumentCLIP, a salience-aware contrastive learning framework to enforce vision-language pretraining models to comprehend the interaction between images and longer text within documents. Our model is beneficial for the real-world multimodal document understanding like news article, magazines, product descriptions, which contain linguistically and visually richer content. To the best of our knowledge, we are the first to explore multimodal intra-document links by contrastive learning. In addition, we collect a large Wikipedia dataset for pretraining, which provides various topics and structures. Experiments show DocumentCLIP not only outperforms the state-of-the-art baselines in the supervised setting, but also achieves the best zero-shot performance in the wild after human evaluation. Our code is available at https://github.com/FuxiaoLiu/DocumentCLIP.
翻译:视觉语言预训练模型通过理解图像与文本的对齐关系,在支持多媒体应用方面取得了巨大成功。然而,现有视觉语言预训练模型主要关注单幅图像与单段文本的关联,通常忽略了由多句话和多幅图像组成的文档内部对齐。本文提出DocumentCLIP,一种显著性感知的对比学习框架,旨在增强视觉语言预训练模型对文档内图像与较长文本交互的理解。我们的模型对包含更丰富语言和视觉内容的现实多模态文档理解(如新闻文章、杂志、产品描述)具有实用价值。据我们所知,这是首个通过对比学习探索多模态文档内部链接的工作。此外,我们收集了一个包含多种主题和结构的大规模维基百科数据集用于预训练。实验表明,DocumentCLIP不仅在监督设置下优于最先进的基线方法,而且在人工评估后实现了最佳的真实场景零样本性能。我们的代码已开源至https://github.com/FuxiaoLiu/DocumentCLIP。