Operating effectively in novel real-world environments requires robotic systems to estimate and interact with previously unseen objects. Current state-of-the-art models address this challenge by using large amounts of training data and test-time samples to build black-box scene representations. In this work, we introduce a differentiable neuro-graphics model that combines neural foundation models with physics-based differentiable rendering to perform zero-shot scene reconstruction and robot grasping without relying on any additional 3D data or test-time samples. Our model solves a series of constrained optimization problems to estimate physically consistent scene parameters, such as meshes, lighting conditions, material properties, and 6D poses of previously unseen objects from a single RGBD image and bounding boxes. We evaluated our approach on standard model-free few-shot benchmarks and demonstrated that it outperforms existing algorithms for model-free few-shot pose estimation. Furthermore, we validated the accuracy of our scene reconstructions by applying our algorithm to a zero-shot grasping task. By enabling zero-shot, physically-consistent scene reconstruction and grasping without reliance on extensive datasets or test-time sampling, our approach offers a pathway towards more data efficient, interpretable and generalizable robot autonomy in novel environments.
翻译:在未知的真实环境中有效操作,要求机器人系统能够对未见过的物体进行估计与交互。当前最先进的模型通过使用大量训练数据和测试时样本来构建黑箱场景表示以应对这一挑战。本文提出一种可微分神经图形学模型,该模型将神经基础模型与基于物理的可微分渲染相结合,无需依赖任何额外的三维数据或测试时样本,即可实现零样本场景重建与机器人抓取。我们的模型通过求解一系列约束优化问题,从单张RGBD图像和边界框出发,估计物理一致的场景参数,包括未见物体的网格、光照条件、材质属性及六维位姿。我们在标准的无模型少样本基准测试上评估了本方法,结果表明其在无模型少样本位姿估计任务中优于现有算法。此外,我们通过将算法应用于零样本抓取任务,验证了场景重建的准确性。本方法实现了不依赖大规模数据集或测试时采样的零样本物理一致场景重建与抓取,为在未知环境中实现更高数据效率、更强可解释性与泛化能力的机器人自主性提供了一条可行路径。