We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, denoised by the model in 3D, and supervised in 2D by reprojecting the prediction and comparing against accurate keypoints. We show that, under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision, and we adapt standard 3D motion regularizers - velocity consistency and over-parameterized representation alignment - to this 2D setting. Unlike methods that lift 2D to 3D only at inference, VideoMDM learns a coherent 3D motion manifold during training. On HumanML3D it nearly closes the gap to fully 3D-supervised MDM (FID 0.88 vs 0.54); On real video datasets Fit3D and NBA the method learns to generate motions consistently preferred by humans, with strong quantitative results.
翻译:本文提出VideoMDM,一种基于扩散模型的框架,可直接从单目视频中提取的精确二维姿态学习三维人体运动先验,无需任何三维真实数据。预训练的二维到三维提升器可提供近似三维姿态序列作为含噪教师信号:这些序列经扩散处理、模型在三维空间去噪后,通过反向投影预测结果并与精确关键点比较进行二维监督。研究证明,在适度假设条件下,深度加权二维反向投影损失与直接三维监督在期望上等价;同时针对二维场景适配了标准三维运动正则化方法(速度一致性及过参数化表示对齐)。与仅在推理阶段进行升维的二维到三维方法不同,VideoMDM在训练过程中即学习了连贯的三维运动流形。在HumanML3D数据集上,本方法几乎弥合了与完全三维监督MDM(FID 0.88 vs 0.54)的差距;在真实视频数据集Fit3D和NBA中,该方法生成的运动持续获得人类偏好,并展现出优异的量化结果。