Recent advances in video generative models enable the synthesis of realistic human-object interaction videos across a wide range of scenarios and object categories, including complex dexterous manipulations that are difficult to capture with motion capture systems. While the rich interaction knowledge embedded in these synthetic videos holds strong potential for motion planning in dexterous robotic manipulation, their limited physical fidelity and purely 2D nature make them difficult to use directly as imitation targets in physics-based character control. We present DeVI (Dexterous Video Imitation), a novel framework that leverages text-conditioned synthetic videos to enable physically plausible dexterous agent control for interacting with unseen target objects. To overcome the imprecision of generative 2D cues, we introduce a hybrid tracking reward that integrates 3D human tracking with robust 2D object tracking. Unlike methods relying on high-quality 3D kinematic demonstrations, DeVI requires only the generated video, enabling zero-shot generalization across diverse objects and interaction types. Extensive experiments demonstrate that DeVI outperforms existing approaches that imitate 3D human-object interaction demonstrations, particularly in modeling dexterous hand-object interactions. We further validate the effectiveness of DeVI in multi-object scenes and text-driven action diversity, showcasing the advantage of using video as an HOI-aware motion planner.
翻译:摘要:视频生成模型的最新进展使得能够在广泛场景和物体类别中合成逼真的人机交互视频,包括运动捕捉系统难以捕获的复杂灵巧操作。尽管这些合成视频中蕴含的丰富交互知识在灵巧机器人操作的运动规划方面具有巨大潜力,但其有限的物理保真度和纯二维特性使其难以直接作为基于物理的角色控制中的模仿目标。我们提出DeVI(灵巧视频模仿),一种新颖框架,利用基于文本条件的合成视频实现对未见目标物体的物理合理灵巧代理控制。为克服生成式二维线索的不精确性,我们引入了一种混合跟踪奖励,将三维人体跟踪与鲁棒的二维物体跟踪相结合。与依赖高质量三维运动学演示的方法不同,DeVI仅需生成的视频,从而能够零样本泛化至不同物体和交互类型。大量实验表明,DeVI在模拟三维人机交互演示的现有方法中表现更优,尤其是在建模灵巧手-物交互方面。我们进一步验证了DeVI在多物体场景和文本驱动的动作多样性中的有效性,展示了将视频作为HOI感知运动规划器的优势。