Human-centric perceptions (e.g., pose estimation, human parsing, pedestrian detection, person re-identification, etc.) play a key role in industrial applications of visual models. While specific human-centric tasks have their own relevant semantic aspect to focus on, they also share the same underlying semantic structure of the human body. However, few works have attempted to exploit such homogeneity and design a general-propose model for human-centric tasks. In this work, we revisit a broad range of human-centric tasks and unify them in a minimalist manner. We propose UniHCP, a Unified Model for Human-Centric Perceptions, which unifies a wide range of human-centric tasks in a simplified end-to-end manner with the plain vision transformer architecture. With large-scale joint training on 33 human-centric datasets, UniHCP can outperform strong baselines on several in-domain and downstream tasks by direct evaluation. When adapted to a specific task, UniHCP achieves new SOTAs on a wide range of human-centric tasks, e.g., 69.8 mIoU on CIHP for human parsing, 86.18 mA on PA-100K for attribute prediction, 90.3 mAP on Market1501 for ReID, and 85.8 JI on CrowdHuman for pedestrian detection, performing better than specialized models tailored for each task.
翻译:人体中心感知(例如姿态估计、人体解析、行人检测、行人重识别等)在视觉模型的工业应用中扮演着关键角色。尽管不同的人体中心任务聚焦于各自相关的语义层面,但它们共享相同的人体底层语义结构。然而,很少有研究尝试利用这种同质性并设计通用的人体中心任务模型。本文重新审视了广泛的人体中心任务,并以极简方式将其统一。我们提出UniHCP(面向人体中心感知的统一模型),该模型采用简化端到端方式,结合普通视觉Transformer架构,统一了多种人体中心任务。通过在33个人体中心数据集上进行大规模联合训练,UniHCP能够在直接评估中超越多个域内及下游任务的强基线模型。当针对特定任务进行适配时,UniHCP在广泛的人体中心任务上取得了新的最佳性能,例如人体解析任务在CIHP上达69.8 mIoU,属性预测任务在PA-100K上达86.18 mA,行人重识别任务在Market1501上达90.3 mAP,行人检测任务在CrowdHuman上达85.8 JI,其表现优于每个任务专门设计的特化模型。