Webpages have been a rich, scalable resource for vision-language and language only tasks. Yet only pieces of webpages are kept: image-caption pairs, long text articles, or raw HTML, never all in one place. Webpage tasks have resultingly received little attention and structured image-text data left underused. To study multimodal webpage understanding, we introduce the Wikipedia Webpage suite (WikiWeb2M) of 2M pages. We verify its utility on three generative tasks: page description generation, section summarization, and contextual image captioning. We design a novel attention mechanism Prefix Global, which selects the most relevant image and text content as global tokens to attend to the rest of the webpage for context. By using page structure to separate such tokens, it performs better than full attention with lower computational complexity. Experiments show that the new annotations from WikiWeb2M improve task performance compared to data from prior work. We also include ablations on sequence length, input features, and model size.
翻译:网页作为一种丰富且可扩展的资源,长期以来被用于视觉-语言任务及纯语言任务。然而,现有研究仅保留了网页的片段信息:图像-文本对、长文文章或原始HTML,从未将所有信息整合于一处。因此,网页相关任务长期未受足够关注,结构化的图文数据亦未得到充分利用。为研究多模态网页理解,我们推出了包含200万页面的维基百科网页套件(WikiWeb2M)。我们验证了该套件在三个生成式任务中的效用:页面描述生成、章节摘要生成及上下文图像描述生成。我们设计了一种新型注意力机制——前缀全局注意力(Prefix Global),该机制选择最相关的图像与文本内容作为全局令牌,使其与网页其余部分进行上下文交互。通过利用页面结构分离此类令牌,该方法在降低计算复杂度的同时,性能优于全注意力机制。实验表明,与先前工作的数据相比,WikiWeb2M提供的新注释能显著提升任务性能。我们还针对序列长度、输入特征及模型规模进行了消融实验。