This paper primarily focuses on evaluating and benchmarking the robustness of visual representations in the context of object assembly tasks. Specifically, it investigates the alignment and insertion of objects with geometrical extrusions and intrusions, commonly referred to as a peg-in-hole task. The accuracy required to detect and orient the peg and the hole geometry in SE(3) space for successful assembly poses significant challenges. Addressing this, we employ a general framework in visuomotor policy learning that utilizes visual pretraining models as vision encoders. Our study investigates the robustness of this framework when applied to a dual-arm manipulation setup, specifically to the grasp variations. Our quantitative analysis shows that existing pretrained models fail to capture the essential visual features necessary for this task. However, a visual encoder trained from scratch consistently outperforms the frozen pretrained models. Moreover, we discuss rotation representations and associated loss functions that substantially improve policy learning. We present a novel task scenario designed to evaluate the progress in visuomotor policy learning, with a specific focus on improving the robustness of intricate assembly tasks that require both geometrical and spatial reasoning. Videos, additional experiments, dataset, and code are available at https://bit.ly/geometric-peg-in-hole .
翻译:本文主要关注于评估和基准测试物体装配任务中视觉表征的鲁棒性。具体而言,研究涉及具有几何凸起与凹陷结构的物体之间的对齐与插入操作,即通常所说的"销孔装配"任务。在SE(3)空间中精确检测并定向销钉与孔洞几何形状以实现成功装配,对精度提出了显著挑战。针对这一问题,我们采用了一种通用的视觉运动策略学习框架,该框架利用视觉预训练模型作为视觉编码器。本研究考察了该框架在双臂操作设置下对抓取变化的鲁棒性。定量分析表明,现有预训练模型无法捕捉完成该任务所必需的视觉特征。然而,从零训练的视觉编码器始终优于冻结的预训练模型。此外,我们讨论了能显著提升策略学习的旋转表征及相关损失函数。我们提出了一种新颖的任务场景,用于评估视觉运动策略学习的进展,特别聚焦于提升需要几何与空间推理的复杂装配任务的鲁棒性。视频、补充实验、数据集及代码均可在 https://bit.ly/geometric-peg-in-hole 获取。