VGSG: Vision-Guided Semantic-Group Network for Text-based Person Search

Text-based Person Search (TBPS) aims to retrieve images of target pedestrian indicated by textual descriptions. It is essential for TBPS to extract fine-grained local features and align them crossing modality. Existing methods utilize external tools or heavy cross-modal interaction to achieve explicit alignment of cross-modal fine-grained features, which is inefficient and time-consuming. In this work, we propose a Vision-Guided Semantic-Group Network (VGSG) for text-based person search to extract well-aligned fine-grained visual and textual features. In the proposed VGSG, we develop a Semantic-Group Textual Learning (SGTL) module and a Vision-guided Knowledge Transfer (VGKT) module to extract textual local features under the guidance of visual local clues. In SGTL, in order to obtain the local textual representation, we group textual features from the channel dimension based on the semantic cues of language expression, which encourages similar semantic patterns to be grouped implicitly without external tools. In VGKT, a vision-guided attention is employed to extract visual-related textual features, which are inherently aligned with visual cues and termed vision-guided textual features. Furthermore, we design a relational knowledge transfer, including a vision-language similarity transfer and a class probability transfer, to adaptively propagate information of the vision-guided textual features to semantic-group textual features. With the help of relational knowledge transfer, VGKT is capable of aligning semantic-group textual features with corresponding visual features without external tools and complex pairwise interaction. Experimental results on two challenging benchmarks demonstrate its superiority over state-of-the-art methods.

翻译：文本人物检索（TBPS）旨在检索与文本描述对应的目标行人图像。提取细粒度局部特征并实现跨模态对齐对TBPS至关重要。现有方法依赖外部工具或繁重的跨模态交互来实现跨模态细粒度特征的显式对齐，导致效率低下且耗时。本文提出一种面向文本人物检索的视觉引导语义分组网络（VGSG），以提取良好对齐的细粒度视觉与文本特征。在VGSG中，我们设计了语义分组文本学习（SGTL）模块和视觉引导知识迁移（VGKT）模块，在视觉局部线索的引导下提取文本局部特征。SGTL通过基于语言表达语义线索的通道维度文本特征分组，隐式地聚合相似语义模式，无需外部工具即可获得局部文本表示。VGKT采用视觉引导注意力机制提取与视觉线索天然对齐的视觉相关文本特征（称为视觉引导文本特征）。此外，我们设计了关系知识迁移机制（包含视觉-语言相似度迁移和类别概率迁移），自适应地将视觉引导文本特征的信息传播至语义分组文本特征。借助关系知识迁移，VGKT无需外部工具和复杂成对交互即可实现语义分组文本特征与对应视觉特征的对齐。在两个具有挑战性的基准数据集上的实验结果表明，该方法优于现有先进方法。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日