An end-to-end (E2E) visuomotor policy is typically treated as a unified whole, but recent approaches using out-of-domain (OOD) data to pretrain the visual encoder have cleanly separated the visual encoder from the network, with the remainder referred to as the policy. We propose Visual Alignment Testing, an experimental framework designed to evaluate the validity of this functional separation. Our results indicate that in E2E-trained models, visual encoders actively contribute to decision-making resulting from motor data supervision, contradicting the assumed functional separation. In contrast, OOD-pretrained models, where encoders lack this capability, experience an average performance drop of 42% in our benchmark results, compared to the state-of-the-art performance achieved by E2E policies. We believe this initial exploration of visual encoders' role can provide a first step towards guiding future pretraining methods to address their decision-making ability, such as developing task-conditioned or context-aware encoders.
翻译:端到端(E2E)视觉运动策略通常被视为一个统一的整体,但近期利用域外(OOD)数据预训练视觉编码器的方法,已将视觉编码器与网络其余部分(称为策略网络)清晰分离。我们提出视觉对齐测试这一实验框架,旨在评估这种功能分离的有效性。我们的结果表明,在E2E训练的模型中,视觉编码器在运动数据监督下积极参与决策过程,这与假定的功能分离相矛盾。相比之下,OOD预训练模型的编码器缺乏这种能力,在我们的基准测试结果中,其性能相较于E2E策略达到的最先进水平平均下降了42%。我们相信,对视觉编码器角色的这一初步探索,可为指导未来预训练方法(例如开发任务条件化或上下文感知编码器)以提升其决策能力提供第一步的参考。