Top-Down Viewing for Weakly Supervised Grounded Image Captioning

Weakly supervised grounded image captioning (WSGIC) aims to generate the caption and ground (localize) predicted object words in the input image without using bounding box supervision. Recent two-stage solutions mostly apply a bottom-up pipeline: (1) first apply an off-the-shelf object detector to encode the input image into multiple region features; (2) and then leverage a soft-attention mechanism for captioning and grounding. However, object detectors are mainly designed to extract object semantics (i.e., the object category). Besides, they break down the structural images into pieces of individual proposals. As a result, the subsequent grounded captioner is often overfitted to find the correct object words, while overlooking the relation between objects (e.g., what is the person doing?), and selecting incompatible proposal regions for grounding. To address these difficulties, we propose a one-stage weakly supervised grounded captioner that directly takes the RGB image as input to perform captioning and grounding at the top-down image level. In addition, we explicitly inject a relation module into our one-stage framework to encourage the relation understanding through multi-label classification. The relation semantics aid the prediction of relation words in the caption. We observe that the relation words not only assist the grounded captioner in generating a more accurate caption but also improve the grounding performance. We validate the effectiveness of our proposed method on two challenging datasets (Flick30k Entities captioning and MSCOCO captioning). The experimental results demonstrate that our method achieves state-of-the-art grounding performance.

翻译：弱监督定位图像描述旨在在不使用边界框监督的情况下，生成输入图像的描述并定位（即确定）预测的目标词。近期两阶段解决方案大多采用自下而上的流程：（1）首先应用现成的目标检测器将输入图像编码为多个区域特征；（2）然后利用软注意力机制进行描述与定位。然而，目标检测器主要设计用于提取目标语义（即目标类别），同时将结构化图像拆解为离散的候选区域块。这导致后续的定位描述器往往过度拟合于寻找正确的目标词，却忽视了目标间关系（例如，人在做什么？），并选择不兼容的候选区域进行定位。为应对这些难题，我们提出了一种单阶段弱监督定位描述器，直接将RGB图像作为输入，在自上而下的图像层面执行描述与定位。此外，我们显式地将关系模块注入单阶段框架中，通过多标签分类促进关系理解。关系语义有助于预测描述中的关系词。我们观察到，关系词不仅帮助定位描述器生成更准确的描述，还提升了定位性能。我们在两个具有挑战性的数据集（Flick30k Entities描述和MSCOCO描述）上验证了所提方法的有效性。实验结果表明，我们的方法达到了最先进的定位性能。