Webpages have been a rich, scalable resource for vision-language and language only tasks. Yet only pieces of webpages are kept in existing datasets: image-caption pairs, long text articles, or raw HTML, never all in one place. Webpage tasks have resultingly received little attention and structured image-text data left underused. To study multimodal webpage understanding, we introduce the Wikipedia Webpage suite (WikiWeb2M) containing 2M pages with all of the associated image, text, and structure data. We verify its utility on three generative tasks: page description generation, section summarization, and contextual image captioning. We design a novel attention mechanism Prefix Global, which selects the most relevant image and text content as global tokens to attend to the rest of the webpage for context. By using page structure to separate such tokens, it performs better than full attention with lower computational complexity. Extensive experiments show that the new data in WikiWeb2M improves task performance compared to prior work.
翻译:网页一直是视觉-语言和纯语言任务的丰富、可扩展资源。然而现有数据集仅保留网页的碎片化信息:图文对、长文本或原始HTML,从未出现完整整合。网页任务因此长期缺乏关注,结构化图文数据也未被充分利用。为研究多模态网页理解,我们引入维基百科网页套件WikiWeb2M,包含200万张网页及其全部关联图像、文本与结构数据。我们在三项生成任务上验证其实用性:页面描述生成、段落摘要生成和上下文图像描述生成。我们设计新型注意力机制前缀全局(Prefix Global),通过选择最相关的图像和文本内容作为全局标记,使其关注网页其余部分的上下文信息。该方法利用页面结构分离此类标记,以更低的计算复杂度超越全注意力机制。大量实验表明,与先前工作相比,WikiWeb2M的新数据显著提升了任务性能。