We present FoundationPose, a unified foundation model for 6D object pose estimation and tracking, supporting both model-based and model-free setups. Our approach can be instantly applied at test-time to a novel object without fine-tuning, as long as its CAD model is given, or a small number of reference images are captured. We bridge the gap between these two setups with a neural implicit representation that allows for effective novel view synthesis, keeping the downstream pose estimation modules invariant under the same unified framework. Strong generalizability is achieved via large-scale synthetic training, aided by a large language model (LLM), a novel transformer-based architecture, and contrastive learning formulation. Extensive evaluation on multiple public datasets involving challenging scenarios and objects indicate our unified approach outperforms existing methods specialized for each task by a large margin. In addition, it even achieves comparable results to instance-level methods despite the reduced assumptions. Project page: https://nvlabs.github.io/FoundationPose/
翻译:我们提出FoundationPose——一个用于6D物体位姿估计与跟踪的统一基础模型,支持基于模型和无模型两种设置。在测试阶段,该方法可直接应用于新物体而无需微调,只需提供其CAD模型或少量参考图像。我们通过神经隐式表示桥接这两种设置,实现高效的新视角合成,同时保持下游位姿估计模块在统一框架下的不变性。通过大规模合成训练、大语言模型(LLM)、新型Transformer架构及对比学习范式,该方法实现了强泛化能力。在涉及复杂场景和物体的多个公开数据集上的广泛评估表明,我们的统一方法在各专项任务中显著优于现有方法。此外,尽管假设条件减少,该方法甚至能达到与实例级方法相当的结果。项目主页:https://nvlabs.github.io/FoundationPose/