Knowing Where to Focus: Attention-Guided Alignment for Text-based Person Search

In the realm of Text-Based Person Search (TBPS), mainstream methods aim to explore more efficient interaction frameworks between text descriptions and visual data. However, recent approaches encounter two principal challenges. Firstly, the widely used random-based Masked Language Modeling (MLM) considers all the words in the text equally during training. However, massive semantically vacuous words ('with', 'the', etc.) be masked fail to contribute efficient interaction in the cross-modal MLM and hampers the representation alignment. Secondly, manual descriptions in TBPS datasets are tedious and inevitably contain several inaccuracies. To address these issues, we introduce an Attention-Guided Alignment (AGA) framework featuring two innovative components: Attention-Guided Mask (AGM) Modeling and Text Enrichment Module (TEM). AGM dynamically masks semantically meaningful words by aggregating the attention weight derived from the text encoding process, thereby cross-modal MLM can capture information related to the masked word from text context and images and align their representations. Meanwhile, TEM alleviates low-quality representations caused by repetitive and erroneous text descriptions by replacing those semantically meaningful words with MLM's prediction. It not only enriches text descriptions but also prevents overfitting. Extensive experiments across three challenging benchmarks demonstrate the effectiveness of our AGA, achieving new state-of-the-art results with Rank-1 accuracy reaching 78.36%, 67.31%, and 67.4% on CUHK-PEDES, ICFG-PEDES, and RSTPReid, respectively.

翻译：在基于文本的行人检索领域，主流方法致力于探索文本描述与视觉数据之间更高效的交互框架。然而，现有方法面临两大主要挑战。首先，广泛使用的基于随机掩码的语言建模在训练过程中平等对待文本中的所有词汇。然而，大量语义空洞的词汇（如“with”、“the”等）被掩码后，无法为跨模态MLM提供有效的交互，反而阻碍了表征对齐。其次，TBPS数据集中的人工描述往往冗长且不可避免地包含一些不准确之处。为解决这些问题，我们提出了一个注意力引导对齐框架，该框架包含两个创新模块：注意力引导掩码建模和文本增强模块。AGM通过聚合文本编码过程中产生的注意力权重，动态地掩码具有语义意义的词汇，从而使跨模态MLM能够从文本上下文和图像中捕获与被掩码词汇相关的信息，并实现其表征的对齐。同时，TEM通过使用MLM的预测结果替换那些具有语义意义的词汇，缓解了由重复和错误文本描述导致的低质量表征问题。这不仅丰富了文本描述，还防止了过拟合。在三个具有挑战性的基准数据集上进行的大量实验证明了我们AGA框架的有效性，在CUHK-PEDES、ICFG-PEDES和RSTPReid数据集上分别达到了78.36%、67.31%和67.4%的Rank-1准确率，取得了新的最先进性能。