Text-based person retrieval (TPR) aims to retrieve images of a person from an extensive array of candidates based on a given textual description. The core challenge lies in mapping visual and textual data into a unified latent space. While existing TPR methods concentrate on recognizing explicit and positive characteristics, they often neglect the critical influence of negative descriptors, resulting in potential false positives that fulfill positive criteria but could be excluded by negative descriptors. To alleviate these issues, we introduce DualFocus, a unified framework for integrating positive and negative descriptors to enhance the interpretative accuracy of vision-language foundational models regarding textual queries. DualFocus employs Dual (Positive/Negative) Attribute Prompt Learning (DAPL), which integrates Dual Image-Attribute Contrastive (DIAC) Learning and Sensitive Image-Attributes Matching (SIAM) Learning. This way DualFocus enhances the detection of unseen attributes, thereby boosting retrieval precision. To further achieve a balance between coarse and fine-grained alignment of visual and textual embeddings, we propose the Dynamic Tokenwise Similarity (DTS) loss, which refines the representation of both matching and non-matching descriptions, thereby enhancing the matching process through a detailed and adaptable similarity assessment. By focusing on token-level comparisons, DualFocus significantly outperforms existing techniques in both precision and robustness. The experiment results highlight DualFocus's superior performance on CUHK-PEDES, ICFG-PEDES, and RSTPReid.
翻译:基于文本的人物检索(TPR)旨在根据给定的文本描述从大量候选图像中检索目标人物图像,其核心挑战在于将视觉与文本数据映射至统一潜空间。现有TPR方法侧重于识别显式正面特征,却往往忽略负面描述符的关键影响,导致满足正面标准却本可通过负面描述符排除的假阳性结果。为缓解该问题,我们提出DualFocus——一种整合正负描述符的统一框架,用于提升视觉语言基础模型对文本查询的语义解析精度。该框架通过双(正/负)属性提示学习(DAPL)机制,融合双路图像-属性对比学习(DIAC)与敏感图像-属性匹配学习(SIAM),从而增强模型对未显式描述属性的检测能力,显著提升检索精度。进一步地,为平衡视觉与文本嵌入的粗粒度与细粒度对齐,我们提出动态令牌级相似度(DTS)损失函数,通过精细化可调相似度评估优化匹配与非匹配描述的表示,强化匹配过程的精准性。基于令牌级语义对比,DualFocus在检索精度与鲁棒性上显著超越现有技术。实验结果表明,该方法在CUHK-PEDES、ICFG-PEDES及RSTPReid数据集上均展现出卓越性能。