Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search

Given a descriptive text query, text-based person search (TBPS) aims to retrieve the best-matched target person from an image gallery. Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data. To better align the two modalities, most existing works focus on introducing sophisticated network structures and auxiliary tasks, which are complex and hard to implement. In this paper, we propose a simple yet effective dual Transformer model for text-based person search. By exploiting a hardness-aware contrastive learning strategy, our model achieves state-of-the-art performance without any special design for local feature alignment or side information. Moreover, we propose a proximity data generation (PDG) module to automatically produce more diverse data for cross-modal training. The PDG module first introduces an automatic generation algorithm based on a text-to-image diffusion model, which generates new text-image pair samples in the proximity space of original ones. Then it combines approximate text generation and feature-level mixup during training to further strengthen the data diversity. The PDG module can largely guarantee the reasonability of the generated samples that are directly used for training without any human inspection for noise rejection. It improves the performance of our model significantly, providing a feasible solution to the data insufficiency problem faced by such fine-grained visual-linguistic tasks. Extensive experiments on two popular datasets of the TBPS task (i.e., CUHK-PEDES and ICFG-PEDES) show that the proposed approach outperforms state-of-the-art approaches evidently, e.g., improving by 3.88%, 4.02%, 2.92% in terms of Top1, Top5, Top10 on CUHK-PEDES. The codes will be available at https://github.com/HCPLab-SYSU/PersonSearch-CTLG

翻译：给定一段描述性文本查询，文本人物搜索（TBPS）旨在从图像库中检索出最佳匹配的目标人物。这种跨模态检索任务因显著的模态差距、细粒度差异以及标注数据不足而极具挑战性。为更好地对齐两种模态，现有工作大多聚焦于引入复杂的网络结构和辅助任务，但这类方法复杂且难以实现。本文提出一种简单而高效的双变换器（Transformer）模型用于文本人物搜索。通过采用基于困难感知的对比学习策略，我们的模型无需任何针对局部特征对齐或辅助信息的特别设计即可达到最先进的性能。此外，我们提出邻近数据生成（PDG）模块，自动为跨模态训练生成更多样化的数据。PDG模块首先引入基于文本到图像扩散模型的自动生成算法，在原始样本的邻近空间中生成新的图文对样本，随后在训练中结合近似文本生成与特征层级混合，以进一步增强数据多样性。该模块能极大保证生成样本的合理性，这些样本可直接用于训练而无需人工噪声筛选。它显著提升了模型性能，为这类细粒度视觉语言任务面临的数据不足问题提供了可行解决方案。在两个主流TBPS数据集（即CUHK-PEDES和ICFG-PEDES）上的大量实验表明，所提方法明显优于现有最先进方法，例如在CUHK-PEDES上的Top1、Top5、Top10指标分别提升3.88%、4.02%、2.92%。代码将发布于https://github.com/HCPLab-SYSU/PersonSearch-CTLG