Object pose estimation plays a vital role in mixed-reality interactions when users manipulate tangible objects as controllers. Traditional vision-based object pose estimation methods leverage 3D reconstruction to synthesize training data. However, these methods are designed for static objects with diffuse colors and do not work well for objects that change their appearance during manipulation, such as deformable objects like plush toys, transparent objects like chemical flasks, reflective objects like metal pitchers, and articulated objects like scissors. To address this limitation, we propose Rocap, a robotic pipeline that emulates human manipulation of target objects while generating data labeled with ground truth pose information. The user first gives the target object to a robotic arm, and the system captures many pictures of the object in various 6D configurations. The system trains a model by using captured images and their ground truth pose information automatically calculated from the joint angles of the robotic arm. We showcase pose estimation for appearance-changing objects by training simple deep-learning models using the collected data and comparing the results with a model trained with synthetic data based on 3D reconstruction via quantitative and qualitative evaluation. The findings underscore the promising capabilities of Rocap.
翻译:物体姿态估计在混合现实交互中具有关键作用,用户可将实体物体作为控制器进行操作。传统的基于视觉的物体姿态估计方法利用三维重建技术合成训练数据。然而,这些方法专为具有漫反射颜色的静态物体设计,对于在操作过程中外观发生变化的物体(如毛绒玩具等可变形物体、化学烧瓶等透明物体、金属水壶等反射性物体、剪刀等铰接物体)效果不佳。为突破此局限,我们提出RoCap——一种通过模拟人类对目标物体的操作行为,同时生成带有真实姿态标注数据的机器人采集流程。用户首先将目标物体递交给机械臂,系统随后采集物体处于多种六维位姿下的图像。系统利用采集到的图像及其真实姿态信息(根据机械臂关节角度自动计算得出)训练模型。我们通过使用采集数据训练简易深度学习模型,并与基于三维重建合成数据训练的模型进行定量与定性对比评估,展示了该方法在外观变化物体姿态估计任务中的应用效果。研究结果充分证明了RoCap技术的应用潜力。