Text-based person search aims to simultaneously localize and identify the target person based on query text from uncropped scene images, which can be regarded as the unified task of person detection and text-based person retrieval task. In this work, we propose a large-scale benchmark dataset named PRW-TPS-CN based on the widely used person search dataset PRW. Our dataset contains 47,102 sentences, which means there is quite more information than existing dataset. These texts precisely describe the person images from top to bottom, which in line with the natural description order. We also provide both Chinese and English descriptions in our dataset for more comprehensive evaluation. These characteristics make our dataset more applicable. To alleviate the inconsistency between person detection and text-based person retrieval, we take advantage of the rich texts in PRW-TPS-CN dataset. We propose to aggregate multiple texts as text prototypes to maintain the prominent text features of a person, which can better reflect the whole character of a person. The overall prototypes lead to generating the image attention map to eliminate the detection misalignment causing the decrease of text-based person retrieval. Thus, the inconsistency between person detection and text-based person retrieval is largely alleviated. We conduct extensive experiments on the PRW-TPS-CN dataset. The experimental results show the PRW-TPS-CN dataset's effectiveness and the state-of-the-art performance of our approach.
翻译:文本行人搜索旨在从未裁剪的场景图像中,根据查询文本同时定位并识别目标行人,可视为行人检测与文本行人检索的统一任务。本文基于广泛使用的行人搜索数据集PRW,提出大规模基准数据集PRW-TPS-CN。该数据集包含47102条语句,信息量远超现有数据集。这些文本从上至下精确描述行人图像,符合自然描述顺序。我们还在数据集中同时提供中英文描述,以实现更全面的评估,这些特性使数据集更具适用性。为缓解行人检测与文本行人检索间的不一致性,我们利用PRW-TPS-CN数据集的丰富文本,提出将多个文本聚合为文本原型以保持行人的显著文本特征,从而更好地反映行人的整体特性。整体原型有助于生成图像注意力图,消除因检测偏差导致的文本行人检索性能下降,进而大幅缓解行人检测与文本行人检索间的不一致性。我们在PRW-TPS-CN数据集上开展大量实验,结果表明PRW-TPS-CN数据集的有效性及我们方法的最优性能。