Fine-grained supervision based on object annotations has been widely used for vision and language pre-training (VLP). However, in real-world application scenarios, aligned multi-modal data is usually in the image-caption format, which only provides coarse-grained supervision. It is cost-expensive to collect object annotations and build object annotation pre-extractor for different scenarios. In this paper, we propose a fine-grained self-supervision signal without object annotations from a replacement perspective. First, we propose a homonym sentence rewriting (HSR) algorithm to provide token-level supervision. The algorithm replaces a verb/noun/adjective/quantifier word of the caption with its homonyms from WordNet. Correspondingly, we propose a replacement vision-language modeling (RVLM) framework to exploit the token-level supervision. Two replaced modeling tasks, i.e., replaced language contrastive (RLC) and replaced language modeling (RLM), are proposed to learn the fine-grained alignment. Extensive experiments on several downstream tasks demonstrate the superior performance of the proposed method.
翻译:基于目标标注的细粒度监督已广泛用于视觉与语言预训练(VLP)。然而,在实际应用场景中,对齐的多模态数据通常采用图像-描述格式,仅提供粗粒度监督。针对不同场景收集目标标注并构建目标标注预提取器成本高昂。本文从替换视角出发,提出一种无需目标标注的细粒度自监督信号。首先,我们提出同义句改写(HSR)算法以提供词元级监督。该算法将描述中的动词/名词/形容词/量词替换为WordNet中的同义词。相应地,我们提出替换视觉-语言建模(RVLM)框架以利用该词元级监督。我们设计了两项替换建模任务,即替换语言对比学习(RLC)与替换语言建模(RLM),以学习细粒度对齐。在多个下游任务上的大量实验表明,所提方法具有优越性能。