Object proposal generation serves as a standard pre-processing step in Vision-Language (VL) tasks (image captioning, visual question answering, etc.). The performance of object proposals generated for VL tasks is currently evaluated across all available annotations, a protocol that we show is misaligned - higher scores do not necessarily correspond to improved performance on downstream VL tasks. Our work serves as a study of this phenomenon and explores the effectiveness of semantic grounding to mitigate its effects. To this end, we propose evaluating object proposals against only a subset of available annotations, selected by thresholding an annotation importance score. Importance of object annotations to VL tasks is quantified by extracting relevant semantic information from text describing the image. We show that our method is consistent and demonstrates greatly improved alignment with annotations selected by image captioning metrics and human annotation when compared against existing techniques. Lastly, we compare current detectors used in the Scene Graph Generation (SGG) benchmark as a use case, which serves as an example of when traditional object proposal evaluation techniques are misaligned.
翻译:对象提议生成是视觉-语言任务(如图像描述、视觉问答等)中的标准预处理步骤。当前对用于视觉-语言任务的对象提议性能评估基于所有可用标注,我们证明该评估协议存在错位——较高评分并不必然对应下游视觉-语言任务性能的提升。本研究针对这一现象展开分析,并探究语义基础方法在缓解其影响方面的有效性。为此,我们提出仅基于通过标注重要性分数阈值筛选出的部分可用标注来评估对象提议。通过从描述图像的文本中提取相关语义信息,可量化对象标注对视觉-语言任务的重要性。实验表明,与现有技术相比,我们的方法具有一致性,且与图像描述指标及人工标注选择的标注展现出显著增强的匹配度。最后,我们将场景图生成基准中使用的当前检测器作为案例进行比较,该案例展示了传统对象提议评估技术存在错位的典型场景。