Top-Down Framework for Weakly-supervised Grounded Image Captioning

Weakly-supervised grounded image captioning (WSGIC) aims to generate the caption and ground (localize) predicted object words in the input image without using bounding box supervision. Recent two-stage solutions mostly apply a bottom-up pipeline: (1) encode the input image into multiple region features using an object detector; (2) leverage region features for captioning and grounding. However, utilizing independent proposals produced by object detectors tends to make the subsequent grounded captioner overfitted in finding the correct object words, overlooking the relation between objects, and selecting incompatible proposal regions for grounding. To address these issues, we propose a one-stage weakly-supervised grounded captioner that directly takes the RGB image as input to perform captioning and grounding at the top-down image level. Specifically, we encode the image into visual token representations and propose a Recurrent Grounding Module (RGM) in the decoder to obtain precise Visual Language Attention Maps (VLAMs), which recognize the spatial locations of the objects. In addition, we explicitly inject a relation module into our one-stage framework to encourage relation understanding through multi-label classification. This relation semantics served as contextual information facilitating the prediction of relation and object words in the caption. We observe that the relation semantic not only assists the grounded captioner in generating a more accurate caption but also improves the grounding performance. We validate the effectiveness of our proposed method on two challenging datasets (Flick30k Entities captioning and MSCOCO captioning). The experimental results demonstrate that our method achieves state-of-the-art grounding performance.

翻译：弱监督地面图像描述生成（WSGIC）旨在无需边界框监督的情况下，生成输入图像的描述并定位（确定）预测对象词的空间位置。现有的两阶段方法大多采用自底向上的流程：（1）使用目标检测器将输入图像编码为多个区域特征；（2）利用区域特征进行描述生成和定位。然而，利用目标检测器产生的独立候选区域往往导致后续的接地描述器在寻找正确对象词时过度拟合，忽视对象间的关系，并选择不兼容的候选区域进行定位。为解决这些问题，我们提出了一种单阶段弱监督接地描述器，直接以RGB图像为输入，在顶层图像层面进行描述生成和定位。具体而言，我们将图像编码为视觉令牌表示，并在解码器中提出循环接地模块（RGM），以获取精确的视觉语言注意力图（VLAM），从而识别对象的空间位置。此外，我们显式地将关系模块注入单阶段框架，通过多标签分类促进关系理解。该关系语义作为上下文信息，辅助描述中关系词和对象词的预测。我们观察到，关系语义不仅帮助接地描述器生成更准确的描述，还提升了定位性能。我们在两个具有挑战性的数据集（Flick30k Entities描述生成和MSCOCO描述生成）上验证了所提方法的有效性。实验结果表明，我们的方法达到了最先进的定位性能。