Text-to-Image (T2I) ReID has attracted a lot of attention in the recent past. CUHK-PEDES, RSTPReid and ICFG-PEDES are the three available benchmarks to evaluate T2I ReID methods. RSTPReid and ICFG-PEDES comprise of identities from MSMT17 but due to limited number of unique persons, the diversity is limited. On the other hand, CUHK-PEDES comprises of 13,003 identities but has relatively shorter text description on average. Further, these datasets are captured in a restricted environment with limited number of cameras. In order to further diversify the identities and provide dense captions, we propose a novel dataset called IIITD-20K. IIITD-20K comprises of 20,000 unique identities captured in the wild and provides a rich dataset for text-to-image ReID. With a minimum of 26 words for a description, each image is densely captioned. We further synthetically generate images and fine-grained captions using Stable-diffusion and BLIP models trained on our dataset. We perform elaborate experiments using state-of-art text-to-image ReID models and vision-language pre-trained models and present a comprehensive analysis of the dataset. Our experiments also reveal that synthetically generated data leads to a substantial performance improvement in both same dataset as well as cross dataset settings. Our dataset is available at https://bit.ly/3pkA3Rj.
翻译:文本到图像(Text-to-Image, T2I)重识别(ReID)近年来受到广泛关注。CUHK-PEDES、RSTPReid和ICFG-PEDES是目前评估T2I重识别方法的三个基准数据集。RSTPReid和ICFG-PEDES包含来自MSMT17的身份数据,但由于独特人物数量有限,其多样性受到限制。另一方面,CUHK-PEDES包含13,003个身份,但平均文本描述较短。此外,这些数据集在受限环境中采集,摄像头数量有限。为增加身份多样性并提供密集描述,我们提出了名为IIITD-20K的新数据集。IIITD-20K包含20,000个在自然场景中采集的独特身份,为文本到图像重识别提供了丰富数据集。每张图像均有至少26个词构成的密集描述。我们进一步利用基于该数据集训练的Stable-diffusion和BLIP模型合成图像和细粒度描述。基于最先进的文本到图像重识别模型和视觉-语言预训练模型,我们进行了详尽的实验,并对数据集进行了全面分析。实验结果表明,合成数据在相同数据集和跨数据集场景下均能显著提升性能。我们的数据集可通过https://bit.ly/3pkA3Rj获取。