Text-Based Person Search (TBPS) has seen significant progress with vision-language models (VLMs), yet it remains constrained by limited training data and the fact that VLMs are not inherently pre-trained for pedestrian-centric recognition. Existing TBPS methods therefore rely on dataset-centric fine-tuning to handle distribution shift, resulting in multiple independently trained models for different datasets. While synthetic data can increase the scale needed to fine-tune VLMs, it does not eliminate dataset-specific adaptation. This motivates a fundamental question: can we train a single unified TBPS model across multiple datasets? We show that naive joint training over all datasets remains sub-optimal because current training paradigms do not scale to a large number of unique person identities and are vulnerable to noisy image-text pairs. To address these challenges, we propose Scale-TBPS with two contributions: (i) a noise-aware unified dataset curation strategy that cohesively merges diverse TBPS datasets; and (ii) a scalable discriminative identity learning framework that remains effective under a large number of unique identities. Extensive experiments on CUHK-PEDES, ICFG-PEDES, RSTPReid, IIITD-20K, and UFine6926 demonstrate that a single Scale-TBPS model outperforms dataset-centric optimized models and naive joint training.
翻译:基于文本的行人检索(TBPS)借助视觉语言模型(VLM)取得了显著进展,但仍受限于训练数据不足以及VLM本身并非针对行人中心识别进行预训练。现有TBPS方法因此依赖以数据集为中心的微调来处理分布偏移,导致针对不同数据集需训练多个独立模型。虽然合成数据可以扩大微调VLM所需的规模,但无法消除数据集特定的适应性调整。这引出了一个根本性问题:我们能否跨多个数据集训练一个统一的TBPS模型?我们发现,对所有数据集进行简单联合训练仍然次优,因为当前训练范式无法扩展到大量独特行人身份,且易受噪声图像-文本对影响。为解决这些挑战,我们提出Scale-TBPS方法,包含两项贡献:(i)一种噪声感知的统一数据集构建策略,能够有机整合多样化的TBPS数据集;(ii)一个可扩展的判别性身份学习框架,在大量独特身份下仍保持有效性。在CUHK-PEDES、ICFG-PEDES、RSTPReid、IIITD-20K和UFine6926数据集上的大量实验表明,单个Scale-TBPS模型的表现优于以数据集为中心的优化模型及简单联合训练方法。