In this paper, we propose a table and image generation task to verify how the knowledge about entities acquired from natural language is retained in Vision & Language (V&L) models. This task consists of two parts: the first is to generate a table containing knowledge about an entity and its related image, and the second is to generate an image from an entity with a caption and a table containing related knowledge of the entity. In both tasks, the model must know the entities used to perform the generation properly. We created the Wikipedia Table and Image Generation (WikiTIG) dataset from about 200,000 infoboxes in English Wikipedia articles to perform the proposed tasks. We evaluated the performance on the tasks with respect to the above research question using the V&L model OFA, which has achieved state-of-the-art results in multiple tasks. Experimental results show that OFA forgets part of its entity knowledge by pre-training as a complement to improve the performance of image related tasks.
翻译:本文提出了一项表格与图像生成任务,旨在验证从自然语言中获取的实体知识如何在视觉与语言(V&L)模型中保留。该任务包含两部分:第一部分为生成包含某实体知识及其相关图像的表格;第二部分为根据实体及其标题和包含该实体相关知识的表格生成图像。在两个子任务中,模型必须正确识别所涉及的实体以完成生成。我们基于英文维基百科中约20万个信息框,构建了维基百科表格与图像生成(WikiTIG)数据集来执行所提出的任务。针对上述研究问题,我们采用在多项任务中取得最优结果的V&L模型OFA评估了任务性能。实验结果表明,OFA通过预训练过程遗忘了部分实体知识,这本质上是提升图像相关任务性能的一种补偿机制。