Text-based Person Search (TBPS) aims to retrieve images of target pedestrian indicated by textual descriptions. It is essential for TBPS to extract fine-grained local features and align them crossing modality. Existing methods utilize external tools or heavy cross-modal interaction to achieve explicit alignment of cross-modal fine-grained features, which is inefficient and time-consuming. In this work, we propose a Vision-Guided Semantic-Group Network (VGSG) for text-based person search to extract well-aligned fine-grained visual and textual features. In the proposed VGSG, we develop a Semantic-Group Textual Learning (SGTL) module and a Vision-guided Knowledge Transfer (VGKT) module to extract textual local features under the guidance of visual local clues. In SGTL, in order to obtain the local textual representation, we group textual features from the channel dimension based on the semantic cues of language expression, which encourages similar semantic patterns to be grouped implicitly without external tools. In VGKT, a vision-guided attention is employed to extract visual-related textual features, which are inherently aligned with visual cues and termed vision-guided textual features. Furthermore, we design a relational knowledge transfer, including a vision-language similarity transfer and a class probability transfer, to adaptively propagate information of the vision-guided textual features to semantic-group textual features. With the help of relational knowledge transfer, VGKT is capable of aligning semantic-group textual features with corresponding visual features without external tools and complex pairwise interaction. Experimental results on two challenging benchmarks demonstrate its superiority over state-of-the-art methods.
翻译:文本人物检索(TBPS)旨在检索与文本描述对应的目标行人图像。提取细粒度局部特征并实现跨模态对齐对TBPS至关重要。现有方法依赖外部工具或繁重的跨模态交互来实现跨模态细粒度特征的显式对齐,导致效率低下且耗时。本文提出一种面向文本人物检索的视觉引导语义分组网络(VGSG),以提取良好对齐的细粒度视觉与文本特征。在VGSG中,我们设计了语义分组文本学习(SGTL)模块和视觉引导知识迁移(VGKT)模块,在视觉局部线索的引导下提取文本局部特征。SGTL通过基于语言表达语义线索的通道维度文本特征分组,隐式地聚合相似语义模式,无需外部工具即可获得局部文本表示。VGKT采用视觉引导注意力机制提取与视觉线索天然对齐的视觉相关文本特征(称为视觉引导文本特征)。此外,我们设计了关系知识迁移机制(包含视觉-语言相似度迁移和类别概率迁移),自适应地将视觉引导文本特征的信息传播至语义分组文本特征。借助关系知识迁移,VGKT无需外部工具和复杂成对交互即可实现语义分组文本特征与对应视觉特征的对齐。在两个具有挑战性的基准数据集上的实验结果表明,该方法优于现有先进方法。