Person attribute recognition and attribute-based retrieval are two core human-centric tasks. In the recognition task, the challenge is specifying attributes depending on a person's appearance, while the retrieval task involves searching for matching persons based on attribute queries. There is a significant relationship between recognition and retrieval tasks. In this study, we demonstrate that if there is a sufficiently robust network to solve person attribute recognition, it can be adapted to facilitate better performance for the retrieval task. Another issue that needs addressing in the retrieval task is the modality gap between attribute queries and persons' images. Therefore, in this paper, we present CLEAR, a unified network designed to address both tasks. We introduce a robust cross-transformers network to handle person attribute recognition. Additionally, leveraging a pre-trained language model, we construct pseudo-descriptions for attribute queries and introduce an effective training strategy to train only a few additional parameters for adapters, facilitating the handling of the retrieval task. Finally, the unified CLEAR model is evaluated on five benchmarks: PETA, PA100K, Market-1501, RAPv2, and UPAR-2024. Without bells and whistles, CLEAR achieves state-of-the-art performance or competitive results for both tasks, significantly outperforming other competitors in terms of person retrieval performance on the widely-used Market-1501 dataset.
翻译:人物属性识别与基于属性的检索是两项核心的人体中心任务。在识别任务中,挑战在于根据人物外观指定属性,而检索任务则涉及基于属性查询匹配目标人物。识别与检索任务之间存在显著关联。本研究表明,若存在足够鲁棒的网络解决人物属性识别问题,则可将其适配以提升检索任务的性能。检索任务中另一待解决的关键问题是属性查询与人物图像之间的模态差异。为此,本文提出CLEAR——一种统一网络架构以同时处理这两项任务。我们引入鲁棒的跨Transformer网络处理人物属性识别,并利用预训练语言模型为属性查询构建伪描述,同时提出高效训练策略仅需训练少量适配器参数即可处理检索任务。最终,统一的CLEAR模型在PETA、PA100K、Market-1501、RAPv2及UPAR-2024五个基准数据集上进行了评估。无需额外复杂设计,CLEAR在两项任务上均达到最优或竞争力水平,尤其在广泛使用的Market-1501数据集上的人物检索性能显著超越其他竞争者。