UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity

Existing text-based person retrieval datasets often have relatively coarse-grained text annotations. This hinders the model to comprehend the fine-grained semantics of query texts in real scenarios. To address this problem, we contribute a new benchmark named \textbf{UFineBench} for text-based person retrieval with ultra-fine granularity. Firstly, we construct a new \textbf{dataset} named UFine6926. We collect a large number of person images and manually annotate each image with two detailed textual descriptions, averaging 80.8 words each. The average word count is three to four times that of the previous datasets. In addition of standard in-domain evaluation, we also propose a special \textbf{evaluation paradigm} more representative of real scenarios. It contains a new evaluation set with cross domains, cross textual granularity and cross textual styles, named UFine3C, and a new evaluation metric for accurately measuring retrieval ability, named mean Similarity Distribution (mSD). Moreover, we propose CFAM, a more efficient \textbf{algorithm} especially designed for text-based person retrieval with ultra fine-grained texts. It achieves fine granularity mining by adopting a shared cross-modal granularity decoder and hard negative match mechanism. With standard in-domain evaluation, CFAM establishes competitive performance across various datasets, especially on our ultra fine-grained UFine6926. Furthermore, by evaluating on UFine3C, we demonstrate that training on our UFine6926 significantly improves generalization to real scenarios compared with other coarse-grained datasets. The dataset and code will be made publicly available at \url{https://github.com/Zplusdragon/UFineBench}.

翻译：现有文本人物检索数据集的文本标注通常较为粗粒度，这阻碍了模型理解真实场景下查询文本的细粒度语义。为解决该问题，我们构建了一个名为\textbf{UFineBench}的细粒度基准数据集。首先，我们构建了一个名为UFine6926的全新\textbf{数据集}，收集大量人物图像并为每张图像标注两条详细的文本描述，平均每条80.8个词，平均词量是此前数据集的三至四倍。除标准域内评估外，我们还提出一种更贴近真实场景的特殊\textbf{评估范式}，包含跨域、跨文本粒度及跨文本风格的评估集UFine3C，以及用于精确度量检索能力的新指标——平均相似度分布（mSD）。此外，我们提出了一种专为超细粒度文本人物检索设计的高效\textbf{算法}CFAM，通过共享跨模态粒度解码器与难负样本匹配机制实现细粒度挖掘。在标准域内评估中，CFAM在多个数据集上均展现了竞争性性能，尤其在超细粒度数据集UFine6926上表现突出。进一步在UFine3C上的评估表明，与其他粗粒度数据集相比，在UFine6926上训练显著提升了模型在真实场景中的泛化能力。数据集与代码将在\url{https://github.com/Zplusdragon/UFineBench}公开。