Vision Transformers (ViT), when paired with large-scale pretraining, have shown remarkable performance across various computer vision tasks, primarily due to their weak inductive bias. However, while such weak inductive bias aids in pretraining scalability, this may hinder the effective adaptation of ViTs for visuo-motor control tasks as a result of the absence of control-centric inductive biases. Such absent inductive biases include spatial locality and translation equivariance bias which convolutions naturally offer. To this end, we introduce Convolution Injector (CoIn), an add-on module that injects convolutions which are rich in locality and equivariance biases into a pretrained ViT for effective adaptation in visuo-motor control. We evaluate CoIn with three distinct types of pretrained ViTs (CLIP, MVP, VC-1) across 12 varied control tasks within three separate domains (Adroit, MetaWorld, DMC), and demonstrate that CoIn consistently enhances control task performance across all experimented environments and models, validating the effectiveness of providing pretrained ViTs with control-centric biases.
翻译:视觉Transformer(ViT)在大规模预训练的配合下,凭借其较弱的归纳偏置,已在多种计算机视觉任务中展现出卓越性能。然而,尽管这种弱归纳偏置有助于预训练的可扩展性,但由于缺乏以控制为中心的归纳偏置,它可能阻碍ViT在视觉运动控制任务中的有效适配。这些缺失的归纳偏置包括卷积天然具备的空间局部性和平移等变性偏置。为此,我们引入了卷积注入器(CoIn),这是一个附加模块,可将富含局部性和等变性偏置的卷积注入预训练的ViT中,以实现其在视觉运动控制中的有效适配。我们在三个独立领域(Adroit、MetaWorld、DMC)的12种不同控制任务中,使用三种不同类型的预训练ViT(CLIP、MVP、VC-1)对CoIn进行了评估,结果表明CoIn在所有实验环境和模型中均能持续提升控制任务性能,这验证了为预训练ViT提供以控制为中心的偏置的有效性。