We are surrounded by various objects with movable, articulated parts, e.g., box, handle, door. An accurate and generalizable perception of articulated parts is essential to enhance robotic manipulation capabilities. Building on this need, recent efforts in articulated parts perception have followed two main directions: One line of work uses pose-based representation, which requires high manual cost; in parallel, affordance-based methods extract future object motion from point tracking without additional manual efforts, but suffer from low-quality data. In this paper, we propose a new representation of articulated parts, Geometric Primary Structure (GPS), an abstraction of the part geometry structure to balance scalability and quality. For efficient and scalable data collection, GPS is integrated with a portable Virtual Reality (VR) device and requires only one minute to annotate one object sequence. This direct human annotation provides higher quality than the estimated affordance. With this efficient VR-GPS system, we collect 41K frames for 234 objects across six part classes, and train a generalizable GPS model with a single RGB-D object image as input. For object manipulation, we deploy a heuristic policy based on GPS prediction. Without any in-domain fine-tuning, our method achieves an 73% success rate, covering 270 initial states for 9 objects. Our code, data and reusable tool are available at https://enlighten0707.github.io/gps.
翻译:我们周围充斥着各种具有可活动部件的物体,例如盒子、把手、门。对可活动部件进行精确且泛化性强的感知,对于提升机器人操作能力至关重要。基于这一需求,近期在可活动部件感知方面的研究主要沿两个方向展开:一种工作采用基于姿态的表示法,需要较高的人工成本;与此同时,基于功能的方法通过点跟踪提取未来物体运动,无需额外人工,但受限于数据质量低。本文提出了一种新的可活动部件表示法——几何主结构(GPS),该结构是对部件几何形状的抽象,旨在平衡可扩展性与数据质量。为实现高效且可扩展的数据采集,GPS与便携式虚拟现实(VR)设备相结合,注释每个物体序列仅需一分钟。这种直接的人工注释比估计的功能方法具有更高的质量。借助高效的VR-GPS系统,我们收集了涵盖6个部件类别的234个物体共41K帧数据,并以单张RGB-D物体图像为输入,训练了一个泛化性强的GPS模型。对于物体操作,我们基于GPS预测部署了一种启发式策略。无需任何领域内微调,我们的方法在9个物体的270种初始状态下达到了73%的成功率。我们的代码、数据和可复用工具已开源在 https://enlighten0707.github.io/gps。