Panoramic Narrative Grounding (PNG) is an emerging visual grounding task that aims to segment visual objects in images based on dense narrative captions. The current state-of-the-art methods first refine the representation of phrase by aggregating the most similar $k$ image pixels, and then match the refined text representations with the pixels of the image feature map to generate segmentation results. However, simply aggregating sampled image features ignores the contextual information, which can lead to phrase-to-pixel mis-match. In this paper, we propose a novel learning framework called Deformable Attention Refined Matching Network (DRMN), whose main idea is to bring deformable attention in the iterative process of feature learning to incorporate essential context information of different scales of pixels. DRMN iteratively re-encodes pixels with the deformable attention network after updating the feature representation of the top-$k$ most similar pixels. As such, DRMN can lead to accurate yet discriminative pixel representations, purify the top-$k$ most similar pixels, and consequently alleviate the phrase-to-pixel mis-match substantially.Experimental results show that our novel design significantly improves the matching results between text phrases and image pixels. Concretely, DRMN achieves new state-of-the-art performance on the PNG benchmark with an average recall improvement 3.5%. The codes are available in: https://github.com/JaMesLiMers/DRMN.
翻译:全景叙事定位(PNG)是一项新兴的视觉定位任务,旨在根据密集的叙事描述对图像中的视觉对象进行分割。当前最先进的方法首先通过聚合最相似的 $k$ 个图像像素来精炼短语表示,然后将精炼后的文本表示与图像特征图的像素进行匹配以生成分割结果。然而,简单聚合采样的图像特征会忽略上下文信息,这可能导致短语与像素之间的错配。在本文中,我们提出了一种名为可变形注意力精炼匹配网络(DRMN)的新型学习框架,其核心思想是在特征学习的迭代过程中引入可变形注意力,以整合不同尺度像素的关键上下文信息。DRMN在更新最相似的top-$k$个像素的特征表示后,利用可变形注意力网络迭代地重新编码像素。通过这种方式,DRMN能够生成准确且具有判别力的像素表示,净化最相似的top-$k$个像素,从而显著缓解短语与像素之间的错配问题。实验结果表明,我们的创新设计大幅提升了文本短语与图像像素之间的匹配效果。具体而言,DRMN在PNG基准测试上取得了新的最先进性能,平均召回率提升3.5%。相关代码已在以下地址开源:https://github.com/JaMesLiMers/DRMN。