In this work, we introduce the Virtual In-Hand Eye Transformer (VIHE), a novel method designed to enhance 3D manipulation capabilities through action-aware view rendering. VIHE autoregressively refines actions in multiple stages by conditioning on rendered views posed from action predictions in the earlier stages. These virtual in-hand views provide a strong inductive bias for effectively recognizing the correct pose for the hand, especially for challenging high-precision tasks such as peg insertion. On 18 manipulation tasks in RLBench simulated environments, VIHE achieves a new state-of-the-art, with a 12% absolute improvement, increasing from 65% to 77% over the existing state-of-the-art model using 100 demonstrations per task. In real-world scenarios, VIHE can learn manipulation tasks with just a handful of demonstrations, highlighting its practical utility. Videos and code implementation can be found at our project site: https://vihe-3d.github.io.
翻译:本文提出虚拟手眼变换器(VIHE),一种通过动作感知视角渲染增强三维操作能力的新方法。VIHE以多阶段自回归方式优化动作,其条件基于早期阶段动作预测生成的渲染视角。这些虚拟手部视角为有效识别手部正确姿态提供了强归纳偏置,尤其适用于插销装配等高精度挑战性任务。在RLBench模拟环境的18项操作任务中,VIHE实现了当前最优性能,在每项任务使用100个演示样本的情况下,较现有最优模型取得12%的绝对提升(从65%提升至77%)。在真实场景中,VIHE仅需少量演示即可学习操作任务,凸显其实用价值。视频及代码实现详见项目网站:https://vihe-3d.github.io。