This paper presents DNAct, a language-conditioned multi-task policy framework that integrates neural rendering pre-training and diffusion training to enforce multi-modality learning in action sequence spaces. To learn a generalizable multi-task policy with few demonstrations, the pre-training phase of DNAct leverages neural rendering to distill 2D semantic features from foundation models such as Stable Diffusion to a 3D space, which provides a comprehensive semantic understanding regarding the scene. Consequently, it allows various applications to challenging robotic tasks requiring rich 3D semantics and accurate geometry. Furthermore, we introduce a novel approach utilizing diffusion training to learn a vision and language feature that encapsulates the inherent multi-modality in the multi-task demonstrations. By reconstructing the action sequences from different tasks via the diffusion process, the model is capable of distinguishing different modalities and thus improving the robustness and the generalizability of the learned representation. DNAct significantly surpasses SOTA NeRF-based multi-task manipulation approaches with over 30% improvement in success rate. Project website: dnact.github.io.
翻译:本文提出DNAct,一种语言条件化的多任务策略框架,该框架融合神经渲染预训练与扩散训练,以在动作序列空间中强化多模态学习。为通过少量演示学习可泛化的多任务策略,DNAct的预训练阶段利用神经渲染将基础模型(如Stable Diffusion)的二维语义特征蒸馏至三维空间,从而提供关于场景的全面语义理解。这一特性使其能够应用于需要丰富三维语义与精确几何结构的挑战性机器人任务。此外,我们引入一种新方法,通过扩散训练学习蕴含多任务演示中固有模态多样性的视觉与语言特征。模型通过扩散过程重构来自不同任务的动作序列,能够区分不同模态,进而提升所学表示的鲁棒性与泛化能力。DNAct在成功率上显著超越现有最优的基于NeRF的多任务操作方法,提升幅度超过30%。项目网站:dnact.github.io。