Learning to Control Physically-simulated 3D Characters via Generating and Mimicking 2D Motions

Video data is more cost-effective than motion capture data for learning 3D character motion controllers, yet synthesizing realistic and diverse behaviors directly from videos remains challenging. Previous approaches typically rely on off-the-shelf motion reconstruction techniques to obtain 3D trajectories for physics-based imitation. These reconstruction methods struggle with generalizability, as they either require 3D training data (potentially scarce) or fail to produce physically plausible poses, hindering their application to challenging scenarios like human-object interaction (HOI) or non-human characters. We tackle this challenge by introducing Mimic2DM, a novel motion imitation framework that learns the control policy directly and solely from widely available 2D keypoint trajectories extracted from videos. By minimizing the reprojection error, we train a general single-view 2D motion tracking policy capable of following arbitrary 2D reference motions in physics simulation, using only 2D motion data. The policy, when trained on diverse 2D motions captured from different or slightly different viewpoints, can further acquire 3D motion tracking capabilities by aggregating multiple views. Moreover, we develop a transformer-based autoregressive 2D motion generator and integrate it into a hierarchical control framework, where the generator produces high-quality 2D reference trajectories to guide the tracking policy. We show that the proposed approach is versatile and can effectively learn to synthesize physically plausible and diverse motions across a range of domains, including dancing, soccer dribbling, and animal movements, without any reliance on explicit 3D motion data. Project Website: https://jiann-li.github.io/mimic2dm/

翻译：相较于动作捕捉数据，视频数据在学习三维角色运动控制方面更具成本效益，但直接从视频中合成真实且多样的行为仍具挑战性。先前方法通常依赖现成的运动重建技术获取基于物理模仿的三维轨迹。这些重建方法在泛化性上存在局限：它们要么需要三维训练数据（可能稀缺），要么难以生成物理合理的姿态，从而限制了其在具挑战性场景（如人-物交互或非人类角色）中的应用。为解决这一难题，我们提出Mimic2DM——一种新颖的运动模仿框架，该框架直接且仅从视频中提取的广泛可用的二维关键点轨迹学习控制策略。通过最小化重投影误差，我们训练了一个通用的单视角二维运动跟踪策略，该策略仅使用二维运动数据即可在物理模拟中跟随任意二维参考运动。当策略在不同或略有差异的视角下捕获的多样化二维运动数据上进行训练后，可通过聚合多视角信息进一步获得三维运动跟踪能力。此外，我们开发了一种基于Transformer的自回归二维运动生成器，并将其集成到分层控制框架中：生成器产生高质量的二维参考轨迹以指导跟踪策略。研究表明，所提方法具有通用性，能够有效学习合成跨多个领域（包括舞蹈、足球运球和动物运动）的物理合理且多样化的动作，且完全不依赖显式的三维运动数据。项目网站：https://jiann-li.github.io/mimic2dm/