Diffusion-based policies show limited generalization in semantic manipulation, posing a key obstacle to the deployment of real-world robots. This limitation arises because relying solely on text instructions is inadequate to direct the policy's attention toward the target object in complex and dynamic environments. To solve this problem, we propose leveraging bounding-box instruction to directly specify target object, and further investigate whether data scaling laws exist in semantic manipulation tasks. Specifically, we design a handheld segmentation device with an automated annotation pipeline, Label-UMI, which enables the efficient collection of demonstration data with semantic labels. We further propose a semantic-motion-decoupled framework that integrates object detection and bounding-box guided diffusion policy to improve generalization and adaptability in semantic manipulation. Throughout extensive real-world experiments on large-scale datasets, we validate the effectiveness of the approach, and reveal a power-law relationship between generalization performance and the number of bounding-box objects. Finally, we summarize an effective data collection strategy for semantic manipulation, which can achieve 85\% success rates across four tasks on both seen and unseen objects. All datasets and code will be released to the community.
翻译:基于扩散的策略在语义操纵任务中表现出有限的泛化能力,这成为现实世界机器人部署的关键障碍。该局限性源于仅依赖文本指令不足以在复杂动态环境中引导策略关注目标对象。为解决此问题,我们提出利用边界框指令直接指定目标对象,并深入探究语义操纵任务中是否存在数据缩放定律。具体而言,我们设计了一款配备自动标注流程Label-UMI的手持式分割设备,能够高效收集带有语义标签的演示数据。我们进一步提出语义-运动解耦框架,该框架整合了目标检测与边界框引导的扩散策略,以提升语义操纵的泛化能力与适应性。通过对大规模数据集进行广泛的实际实验,我们验证了该方法的有效性,并揭示了泛化性能与边界框对象数量之间的幂律关系。最后,我们总结出一套高效的语义操纵数据收集策略,该策略在已见与未见对象的四项任务中均可实现85%的成功率。所有数据集与代码将向社区开源。