In this work, we introduce the Virtual In-Hand Eye Transformer (VIHE), a novel method designed to enhance 3D manipulation capabilities through action-aware view rendering. VIHE autoregressively refines actions in multiple stages by conditioning on rendered views posed from action predictions in the earlier stages. These virtual in-hand views provide a strong inductive bias for effectively recognizing the correct pose for the hand, especially for challenging high-precision tasks such as peg insertion. On 18 manipulation tasks in RLBench simulated environments, VIHE achieves a new state-of-the-art, with a 12% absolute improvement, increasing from 65% to 77% over the existing state-of-the-art model using 100 demonstrations per task. In real-world scenarios, VIHE can learn manipulation tasks with just a handful of demonstrations, highlighting its practical utility. Videos and code implementation can be found at our project site: https://vihe-3d.github.io.
翻译:本文提出虚拟手眼Transformer(VIHE),一种通过动作感知视角渲染增强三维操控能力的新方法。VIHE通过多阶段自回归优化动作,其条件设定基于早期阶段动作预测所生成的渲染视角。这些虚拟手部视角为准确识别手部姿态提供了强归纳偏置,尤其适用于高精度插桩等挑战性任务。在RLBench模拟环境的18项操控任务中,VIHE实现了新的最优性能,在每项任务使用100组示范数据的条件下,绝对精度提升12%(从65%提升至77%)。真实场景下,VIHE仅需少量示范即可学习操控任务,凸显其实用价值。视频及代码实现详见项目主页:https://vihe-3d.github.io