In this work, we focus on the challenge of temporally consistent human-centric dense prediction across video sequences. Existing models achieve strong per-frame accuracy but often flicker under motion, occlusion, and lighting changes, and they rarely have paired human video supervision for multiple dense tasks. We address this gap with a scalable synthetic data pipeline that generates photorealistic human frames and motion-aligned sequences with pixel-accurate depth, normals, and masks. Unlike prior static data synthetic pipelines, our pipeline provides both frame-level labels for spatial learning and sequence-level supervision for temporal learning. Building on this, we train a unified ViT-based dense predictor that (i) injects an explicit human geometric prior via CSE embeddings and (ii) improves geometry-feature reliability with a lightweight channel reweighting module after feature fusion. Our two-stage training strategy, combining static pretraining with dynamic sequence supervision, enables the model first to acquire robust spatial representations and then refine temporal consistency across motion-aligned sequences. Extensive experiments show that we achieve state-of-the-art performance on THuman2.1 and Hi4D and generalize effectively to in-the-wild videos.
翻译:本研究聚焦于视频序列中时序一致的人体中心密集预测这一挑战。现有模型虽能实现较强的单帧精度,但在运动、遮挡和光照变化下常出现闪烁现象,且鲜有适用于多种密集任务的配对人体视频监督数据。我们通过构建可扩展的合成数据流水线来填补这一空白,该流水线可生成具有照片级真实感的人体帧及运动对齐序列,并提供像素级精确的深度、法线和掩码标签。与以往静态合成数据流水线不同,本流水线同时提供用于空间学习的帧级标签与用于时序学习的序列级监督。基于此,我们训练了一个统一的ViT基密集预测器,其具备以下特性:(i)通过CSE嵌入显式注入人体几何先验;(ii)在特征融合后通过轻量级通道重加权模块提升几何特征可靠性。我们采用两阶段训练策略,将静态预训练与动态序列监督相结合,使模型先习得鲁棒的空间表征,再通过运动对齐序列优化时序一致性。大量实验表明,我们在THuman2.1和Hi4D数据集上取得了最先进的性能,并能有效泛化至真实场景视频。