Using only image-sentence pairs, weakly-supervised visual-textual grounding aims to learn region-phrase correspondences of the respective entity mentions. Compared to the supervised approach, learning is more difficult since bounding boxes and textual phrases correspondences are unavailable. In light of this, we propose the Semantic Prior Refinement Model (SPRM), whose predictions are obtained by combining the output of two main modules. The first untrained module aims to return a rough alignment between textual phrases and bounding boxes. The second trained module is composed of two sub-components that refine the rough alignment to improve the accuracy of the final phrase-bounding box alignments. The model is trained to maximize the multimodal similarity between an image and a sentence, while minimizing the multimodal similarity of the same sentence and a new unrelated image, carefully selected to help the most during training. Our approach shows state-of-the-art results on two popular datasets, Flickr30k Entities and ReferIt, shining especially on ReferIt with a 9.6% absolute improvement. Moreover, thanks to the untrained component, it reaches competitive performances just using a small fraction of training examples.
翻译:仅利用图像-句子对,弱监督视觉-文本对齐旨在学习实体提及的对应区域-短语关系。相较于监督式方法,由于边界框与文本短语对应关系不可获取,该学习任务更具挑战性。针对这一问题,我们提出语义先验精化模型(SPRM),其预测结果由两个核心模块的输出融合而成。首个无训练模块旨在返回文本短语与边界框间的粗略对齐,第二个可训练模块包含两个子组件,通过精化粗略对齐提升最终短语-边界框对齐的准确度。模型训练目标为最大化图像与句子的多模态相似度,同时最小化同一句子与经过精心选取、能在训练中提供最大助益的不相关图像之间的多模态相似度。本方法在Flickr30k Entities与ReferIt两个主流数据集上均取得当前最优结果,尤其在ReferIt数据集上实现9.6%的绝对性能提升。此外,得益于无训练组件,仅需少量训练样本即可达到具有竞争力的表现。