This paper primarily focuses on evaluating and benchmarking the robustness of visual representations in the context of object assembly tasks. Specifically, it investigates the alignment and insertion of objects with geometrical extrusions and intrusions, commonly referred to as a peg-in-hole task. The accuracy required to detect and orient the peg and the hole geometry in SE(3) space for successful assembly poses significant challenges. Addressing this, we employ a general framework in visuomotor policy learning that utilizes visual pretraining models as vision encoders. Our study investigates the robustness of this framework when applied to a dual-arm manipulation setup, specifically to the grasp variations. Our quantitative analysis shows that existing pretrained models fail to capture the essential visual features necessary for this task. However, a visual encoder trained from scratch consistently outperforms the frozen pretrained models. Moreover, we discuss rotation representations and associated loss functions that substantially improve policy learning. We present a novel task scenario designed to evaluate the progress in visuomotor policy learning, with a specific focus on improving the robustness of intricate assembly tasks that require both geometrical and spatial reasoning. Videos, additional experiments, dataset, and code are available at https://bit.ly/geometric-peg-in-hole .
翻译:本文主要关注评估和基准测试物体装配任务中视觉表示的鲁棒性。具体而言,研究涉及具有几何凸起和凹陷的物体的对齐与插入,即通常所说的销孔装配任务。成功装配需要在SE(3)空间中精确检测并定位销钉与孔洞的几何结构,这带来了显著挑战。针对此问题,我们采用了一种通用的视觉运动策略学习框架,该框架利用视觉预训练模型作为视觉编码器。本研究探究了该框架在双臂操作设置中,特别是针对抓取变化时的鲁棒性。定量分析表明,现有预训练模型未能捕获该任务所需的必要视觉特征。然而,从零开始训练的视觉编码器始终优于冻结的预训练模型。此外,我们讨论了能显著提升策略学习的旋转表示及相关损失函数。我们提出了一种新颖的任务场景,旨在评估视觉运动策略学习的进展,并重点关注提升需要几何与空间推理的复杂装配任务的鲁棒性。视频、补充实验、数据集及代码可访问 https://bit.ly/geometric-peg-in-hole。