Unified Multi-Dataset Training for TBPS

Text-Based Person Search (TBPS) has seen significant progress with vision-language models (VLMs), yet it remains constrained by limited training data and the fact that VLMs are not inherently pre-trained for pedestrian-centric recognition. Existing TBPS methods therefore rely on dataset-centric fine-tuning to handle distribution shift, resulting in multiple independently trained models for different datasets. While synthetic data can increase the scale needed to fine-tune VLMs, it does not eliminate dataset-specific adaptation. This motivates a fundamental question: can we train a single unified TBPS model across multiple datasets? We show that naive joint training over all datasets remains sub-optimal because current training paradigms do not scale to a large number of unique person identities and are vulnerable to noisy image-text pairs. To address these challenges, we propose Scale-TBPS with two contributions: (i) a noise-aware unified dataset curation strategy that cohesively merges diverse TBPS datasets; and (ii) a scalable discriminative identity learning framework that remains effective under a large number of unique identities. Extensive experiments on CUHK-PEDES, ICFG-PEDES, RSTPReid, IIITD-20K, and UFine6926 demonstrate that a single Scale-TBPS model outperforms dataset-centric optimized models and naive joint training.

翻译：基于文本的行人检索（TBPS）借助视觉语言模型（VLM）取得了显著进展，但仍受限于训练数据不足以及VLM本身并非针对行人中心识别进行预训练。现有TBPS方法因此依赖以数据集为中心的微调来处理分布偏移，导致针对不同数据集需训练多个独立模型。虽然合成数据可以扩大微调VLM所需的规模，但无法消除数据集特定的适应性调整。这引出了一个根本性问题：我们能否跨多个数据集训练一个统一的TBPS模型？我们发现，对所有数据集进行简单联合训练仍然次优，因为当前训练范式无法扩展到大量独特行人身份，且易受噪声图像-文本对影响。为解决这些挑战，我们提出Scale-TBPS方法，包含两项贡献：（i）一种噪声感知的统一数据集构建策略，能够有机整合多样化的TBPS数据集；（ii）一个可扩展的判别性身份学习框架，在大量独特身份下仍保持有效性。在CUHK-PEDES、ICFG-PEDES、RSTPReid、IIITD-20K和UFine6926数据集上的大量实验表明，单个Scale-TBPS模型的表现优于以数据集为中心的优化模型及简单联合训练方法。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurlPS2024】一种适用于跨模态和任务的视觉-语言模型的统一去偏方法

专知会员服务

22+阅读 · 2024年10月11日

【WWW2023教程】将网络规模的文本转化为知识:将预训练表示转化为文本挖掘应用

专知会员服务

29+阅读 · 2023年5月17日

多模态数据的行为识别综述

专知会员服务

88+阅读 · 2022年11月30日