Different from Object Detection, Visual Grounding deals with detecting a bounding box for each text-image pair. This one box for each text-image data provides sparse supervision signals. Although previous works achieve impressive results, their passive utilization of annotation, i.e. the sole use of the box annotation as regression ground truth, results in a suboptimal performance. In this paper, we present SegVG, a novel method transfers the box-level annotation as Segmentation signals to provide an additional pixel-level supervision for Visual Grounding. Specifically, we propose the Multi-layer Multi-task Encoder-Decoder as the target grounding stage, where we learn a regression query and multiple segmentation queries to ground the target by regression and segmentation of the box in each decoding layer, respectively. This approach allows us to iteratively exploit the annotation as signals for both box-level regression and pixel-level segmentation. Moreover, as the backbones are typically initialized by pretrained parameters learned from unimodal tasks and the queries for both regression and segmentation are static learnable embeddings, a domain discrepancy remains among these three types of features, which impairs subsequent target grounding. To mitigate this discrepancy, we introduce the Triple Alignment module, where the query, text, and vision tokens are triangularly updated to share the same space by triple attention mechanism. Extensive experiments on five widely used datasets validate our state-of-the-art (SOTA) performance.
翻译:与目标检测不同,视觉定位任务需要为每个文本-图像对检测出一个边界框。这种每个文本-图像数据仅对应一个框的设定提供了稀疏的监督信号。尽管先前的研究取得了显著成果,但它们对标注信息的被动利用——即仅将边界框标注作为回归的真实标签——导致了次优的性能。本文提出SegVG,这是一种新颖的方法,将框级标注转换为分割信号,从而为视觉定位提供额外的像素级监督。具体而言,我们设计了多层多任务编码器-解码器作为目标定位阶段,在该阶段中,我们通过学习一个回归查询和多个分割查询,分别在每个解码层中通过边界框的回归与分割来实现目标定位。该方法使我们能够迭代地将标注信息同时作为框级回归和像素级分割的监督信号。此外,由于骨干网络通常通过单模态任务预训练参数初始化,且回归与分割的查询均为静态可学习嵌入向量,这三类特征之间存在领域差异,从而影响了后续的目标定位。为缓解此差异,我们引入了三重对齐模块,通过三重注意力机制对查询、文本和视觉标记进行三角更新,使其共享同一特征空间。在五个广泛使用的数据集上的大量实验验证了我们所提出的方法达到了最先进的性能水平。