Learning a precise robotic grasping policy is crucial for embodied agents operating in complex real-world manipulation tasks. Despite significant advancements, most models still struggle with accurate spatial positioning of objects to be grasped. We first show that this spatial generalization challenge stems primarily from the extensive data requirements for adequate spatial understanding. However, collecting such data with real robots is prohibitively expensive, and relying on simulation data often leads to visual generalization gaps upon deployment. To overcome these challenges, we then focus on state-based policy generalization and present \textbf{ManiBox}, a novel bounding-box-guided manipulation method built on a simulation-based teacher-student framework. The teacher policy efficiently generates scalable simulation data using bounding boxes, which are proven to uniquely determine the objects' spatial positions. The student policy then utilizes these low-dimensional spatial states to enable zero-shot transfer to real robots. Through comprehensive evaluations in simulated and real-world environments, ManiBox demonstrates a marked improvement in spatial grasping generalization and adaptability to diverse objects and backgrounds. Further, our empirical study into scaling laws for policy performance indicates that spatial volume generalization scales with data volume in a power law. For a certain level of spatial volume, the success rate of grasping empirically follows Michaelis-Menten kinetics relative to data volume, showing a saturation effect as data increases. Our videos and code are available in https://thkkk.github.io/manibox.
翻译:学习精确的机器人抓取策略对于在复杂现实世界操作任务中的具身智能体至关重要。尽管已有显著进展,大多数模型在待抓取物体的精确定位方面仍面临困难。我们首先证明这种空间泛化挑战主要源于充分空间理解所需的海量数据需求。然而,通过真实机器人收集此类数据成本极高,而依赖仿真数据往往在部署时产生视觉泛化差距。为克服这些挑战,我们聚焦于基于状态的策略泛化,提出\textbf{ManiBox}——一种基于仿真师生框架的新型边界框引导操作方法。教师策略利用边界框高效生成可扩展的仿真数据,边界框被证明能唯一确定物体的空间位置。学生策略随后利用这些低维空间状态实现向真实机器人的零样本迁移。通过在仿真和现实环境中的综合评估,ManiBox在空间抓取泛化能力以及对不同物体和背景的适应性方面展现出显著提升。此外,我们对策略性能缩放规律的实证研究表明,空间体积泛化能力随数据量呈幂律关系增长。对于特定空间体积水平,抓取成功率相对于数据量遵循米氏动力学规律,随着数据增加呈现饱和效应。我们的演示视频与代码公开于 https://thkkk.github.io/manibox。