This paper primarily focuses on evaluating and benchmarking the robustness of visual representations in the context of object assembly tasks. Specifically, it investigates the alignment and insertion of objects with geometrical extrusions and intrusions, commonly referred to as a peg-in-hole task. The accuracy required to detect and orient the peg and the hole geometry in SE(3) space for successful assembly poses significant challenges. Addressing this, we employ a general framework in visuomotor policy learning that utilizes visual pretraining models as vision encoders. Our study investigates the robustness of this framework when applied to a dual-arm manipulation setup, specifically to the grasp variations. Our quantitative analysis shows that existing pretrained models fail to capture the essential visual features necessary for this task. However, a visual encoder trained from scratch consistently outperforms the frozen pretrained models. Moreover, we discuss rotation representations and associated loss functions that substantially improve policy learning. We present a novel task scenario designed to evaluate the progress in visuomotor policy learning, with a specific focus on improving the robustness of intricate assembly tasks that require both geometrical and spatial reasoning. Videos, additional experiments, dataset, and code are available at https://bit.ly/geometric-peg-in-hole .
翻译:本文主要聚焦于评估和基准测试物体装配任务中视觉表征的鲁棒性。具体而言,研究了具有几何凸起与凹陷结构的物体(即经典插销入孔任务)的对准与插入过程。在SE(3)空间中精确检测并定位插销与孔洞几何形状对成功装配构成了显著挑战。为此,我们采用了一种通用的视动策略学习框架,该框架利用视觉预训练模型作为视觉编码器。本研究探究了该框架在双臂操纵场景下(特别是抓取姿态变化场景中)的鲁棒性。定量分析表明,现有预训练模型无法捕获该任务所需的关键视觉特征,而从头训练的视觉编码器在性能上始终优于冻结的预训练模型。此外,我们讨论了能显著提升策略学习效果的旋转表征方法及其相关损失函数。本文提出了一种新颖的任务场景,旨在评估视动策略学习的进展,尤其侧重于提升需要几何与空间双重推理的复杂装配任务的鲁棒性。视频、补充实验、数据集及代码均可在https://bit.ly/geometric-peg-in-hole获取。