Monocular video human mesh recovery faces fundamental challenges in maintaining metric consistency and temporal stability due to inherent depth ambiguities and scale uncertainties. While existing methods rely primarily on RGB features and temporal smoothing, they struggle with depth ordering, scale drift, and occlusion-induced instabilities. We propose a comprehensive depth-guided framework that achieves metric-aware temporal consistency through three synergistic components: A Depth-Guided Multi-Scale Fusion module that adaptively integrates geometric priors with RGB features via confidence-aware gating; A Depth-guided Metric-Aware Pose and Shape (D-MAPS) estimator that leverages depth-calibrated bone statistics for scale-consistent initialization; A Motion-Depth Aligned Refinement (MoDAR) module that enforces temporal coherence through cross-modal attention between motion dynamics and geometric cues. Our method achieves superior results on three challenging benchmarks, demonstrating significant improvements in robustness against heavy occlusion and spatial accuracy while maintaining computational efficiency.
翻译:单目视频人体网格恢复面临保持度量一致性与时序稳定性的根本挑战,这源于固有的深度歧义与尺度不确定性。现有方法主要依赖RGB特征与时序平滑技术,但在深度排序、尺度漂移以及遮挡引起的稳定性问题上仍存在困难。本文提出一个全面的深度引导框架,通过三个协同组件实现度量感知的时序一致性:一个深度引导的多尺度融合模块,通过置信感知门控自适应地整合几何先验与RGB特征;一个深度引导的度量感知姿态与形状估计器,利用深度校准的骨骼统计数据进行尺度一致初始化;一个运动-深度对齐优化模块,通过运动动态与几何线索间的跨模态注意力强化时序连贯性。本方法在三个具有挑战性的基准测试中取得了优越的结果,在保持计算效率的同时,显著提升了针对严重遮挡的鲁棒性与空间精度。