We present Actron3D, a framework that enables robots to acquire transferable 6-DoF manipulation skills from just a few monocular, uncalibrated, RGB-only human videos. At its core lies the Neural Affordance Function, a compact object-centric representation that distills actionable cues from diverse uncalibrated videos-geometry, visual appearance, and affordance-into a lightweight neural network, forming a memory bank of manipulation skills. During deployment, we adopt a pipeline that retrieves relevant affordance functions and transfers precise 6-DoF manipulation policies via coarse-to-fine optimization, enabled by continuous queries to the multimodal features encoded in the neural functions. Experiments in both simulation and the real world demonstrate that Actron3D significantly outperforms prior methods, achieving a 14.9 percentage point improvement in average success rate across 13 tasks while requiring only 2-3 demonstration videos per task.
翻译:我们提出了Actron3D,一个使机器人能够仅从少量单目、未标定、仅RGB的人类演示视频中学习可迁移的六自由度操控技能的框架。其核心是神经可供性函数,这是一种紧凑的以物体为中心的表示,它将来自多样化未标定视频中的可操作线索——几何、视觉外观和可供性——提炼到一个轻量级神经网络中,从而形成一个操控技能的记忆库。在部署阶段,我们采用一个流程来检索相关的可供性函数,并通过粗到细的优化来迁移精确的六自由度操控策略,这一过程得益于对编码在神经函数中的多模态特征进行连续查询。在仿真和真实世界中的实验表明,Actron3D显著优于现有方法,在13个任务上的平均成功率提高了14.9个百分点,而每个任务仅需2-3个演示视频。