Humanoid motion control has witnessed significant breakthroughs in recent years, with deep reinforcement learning (RL) emerging as a primary catalyst for achieving complex, human-like behaviors. However, the high dimensionality and intricate dynamics of humanoid robots make manual motion design impractical, leading to a heavy reliance on expensive motion capture (MoCap) data. These datasets are not only costly to acquire but also frequently lack the necessary geometric context of the surrounding physical environment. Consequently, existing motion synthesis frameworks often suffer from a decoupling of motion and scene, resulting in physical inconsistencies such as contact slippage or mesh penetration during terrain-aware tasks. In this work, we present MeshMimic, an innovative framework that bridges 3D scene reconstruction and embodied intelligence to enable humanoid robots to learn coupled "motion-terrain" interactions directly from video. By leveraging state-of-the-art 3D vision models, our framework precisely segments and reconstructs both human trajectories and the underlying 3D geometry of terrains and objects. We introduce an optimization algorithm based on kinematic consistency to extract high-quality motion data from noisy visual reconstructions, alongside a contact-invariant retargeting method that transfers human-environment interaction features to the humanoid agent. Experimental results demonstrate that MeshMimic achieves robust, highly dynamic performance across diverse and challenging terrains. Our approach proves that a low-cost pipeline utilizing only consumer-grade monocular sensors can facilitate the training of complex physical interactions, offering a scalable path toward the autonomous evolution of humanoid robots in unstructured environments.
翻译:近年来,人形机器人运动控制领域取得了重大突破,其中深度强化学习已成为实现复杂、类人行为的主要推动力。然而,人形机器人固有的高维度和复杂动力学特性使得手动运动设计不切实际,导致严重依赖昂贵的运动捕捉数据。这些数据集不仅获取成本高昂,而且常常缺乏周围物理环境必要的几何上下文信息。因此,现有的运动合成框架往往存在运动与场景解耦的问题,导致在需要地形感知的任务中出现接触滑移或网格穿透等物理不一致现象。本文提出MeshMimic,这是一个创新性框架,它通过桥接三维场景重建与具身智能,使人形机器人能够直接从视频中学习耦合的“运动-地形”交互。通过利用先进的三维视觉模型,我们的框架能够精确分割并重建人体运动轨迹以及地形和物体的底层三维几何结构。我们引入了一种基于运动学一致性的优化算法,从含噪声的视觉重建中提取高质量运动数据,同时提出了一种接触不变的重定向方法,将人与环境的交互特征迁移至人形智能体。实验结果表明,MeshMimic在多样且具有挑战性的地形上实现了鲁棒、高度动态的性能。我们的方法证明,仅使用消费级单目传感器的低成本流程能够促进复杂物理交互的训练,为无结构环境中人形机器人的自主演进提供了一条可扩展的路径。