Human-centric perceptions (e.g., pose estimation, human parsing, pedestrian detection, person re-identification, etc.) play a key role in industrial applications of visual models. While specific human-centric tasks have their own relevant semantic aspect to focus on, they also share the same underlying semantic structure of the human body. However, few works have attempted to exploit such homogeneity and design a general-propose model for human-centric tasks. In this work, we revisit a broad range of human-centric tasks and unify them in a minimalist manner. We propose UniHCP, a Unified Model for Human-Centric Perceptions, which unifies a wide range of human-centric tasks in a simplified end-to-end manner with the plain vision transformer architecture. With large-scale joint training on 33 human-centric datasets, UniHCP can outperform strong baselines on several in-domain and downstream tasks by direct evaluation. When adapted to a specific task, UniHCP achieves new SOTAs on a wide range of human-centric tasks, e.g., 69.8 mIoU on CIHP for human parsing, 86.18 mA on PA-100K for attribute prediction, 90.3 mAP on Market1501 for ReID, and 85.8 JI on CrowdHuman for pedestrian detection, performing better than specialized models tailored for each task.
翻译:人体中心感知(如姿态估计、人体解析、行人检测、行人重识别等)在视觉模型的工业应用中发挥着关键作用。尽管各类人体中心任务各自聚焦于特定的语义维度,但它们共享着人体相同的底层语义结构。然而,目前鲜有工作尝试利用这种同质性来设计面向人体中心任务的通用模型。本研究重新审视了广泛的人体中心任务,并以极简方式对其进行了统一。我们提出了UniHCP(面向人体中心感知的统一模型),该模型采用简化的端到端方式,基于纯视觉Transformer架构统一了多样的人体中心任务。通过在33个人体中心数据集上进行大规模联合训练,UniHCP在多个域内和下游任务的直接评估中超越了强基线模型。当适配到特定任务时,UniHCP在广泛的人体中心任务上实现了新的最优性能(SOTA),例如人体解析任务CIHP上达到69.8 mIoU,属性预测任务PA-100K上达到86.18 mA,行人重识别任务Market1501上达到90.3 mAP,行人检测任务CrowdHuman上达到85.8 JI,其表现优于为各任务专门定制的模型。