Weakly supervised grounded image captioning (WSGIC) aims to generate the caption and ground (localize) predicted object words in the input image without using bounding box supervision. Recent two-stage solutions mostly apply a bottom-up pipeline: (1) first apply an off-the-shelf object detector to encode the input image into multiple region features; (2) and then leverage a soft-attention mechanism for captioning and grounding. However, object detectors are mainly designed to extract object semantics (i.e., the object category). Besides, they break down the structural images into pieces of individual proposals. As a result, the subsequent grounded captioner is often overfitted to find the correct object words, while overlooking the relation between objects (e.g., what is the person doing?), and selecting incompatible proposal regions for grounding. To address these difficulties, we propose a one-stage weakly supervised grounded captioner that directly takes the RGB image as input to perform captioning and grounding at the top-down image level. In addition, we explicitly inject a relation module into our one-stage framework to encourage the relation understanding through multi-label classification. The relation semantics aid the prediction of relation words in the caption. We observe that the relation words not only assist the grounded captioner in generating a more accurate caption but also improve the grounding performance. We validate the effectiveness of our proposed method on two challenging datasets (Flick30k Entities captioning and MSCOCO captioning). The experimental results demonstrate that our method achieves state-of-the-art grounding performance.
翻译:弱监督场景图像描述生成(WSGIC)旨在无需使用边界框监督的情况下,生成对输入图像中预测对象词的描述文本并进行定位。现有两阶段解决方案多采用自底向上流程:(1)首先应用现成的目标检测器将输入图像编码为多个区域特征;(2)随后利用软注意力机制进行描述生成与定位。然而,目标检测器主要设计用于提取对象语义(即对象类别),同时会将结构化图像分割为独立的候选区域。这导致后续的描述生成器常过度拟合于寻找正确对象词,而忽视对象间关系(如"人物在做什么?"),并选择不兼容的候选区域进行定位。针对这些难题,我们提出一种单阶段弱监督描述生成器,该生成器直接以RGB图像为输入,在俯视图层面进行描述生成与定位。此外,我们显式地在单阶段框架中注入关系模块,通过多标签分类促进关系理解。关系语义有助于描述文本中关系词的预测。实验发现关系词不仅能辅助描述生成器生成更准确的描述文本,还能提升定位性能。我们在Flick30k Entities描述生成和MSCOCO描述生成两个具有挑战性的数据集上验证了所提方法的有效性,实验结果表明该方法达到了最优的定位性能。