Vision-based robot policy learning, which maps visual inputs to actions, necessitates a holistic understanding of diverse visual tasks beyond single-task needs like classification or segmentation. Inspired by this, we introduce Theia, a vision foundation model for robot learning that distills multiple off-the-shelf vision foundation models trained on varied vision tasks. Theia's rich visual representations encode diverse visual knowledge, enhancing downstream robot learning. Extensive experiments demonstrate that Theia outperforms its teacher models and prior robot learning models using less training data and smaller model sizes. Additionally, we quantify the quality of pre-trained visual representations and hypothesize that higher entropy in feature norm distributions leads to improved robot learning performance. Code and models are available at https://github.com/bdaiinstitute/theia.
翻译:基于视觉的机器人策略学习将视觉输入映射为动作,这需要超越分类或分割等单一任务需求的多样化视觉任务整体理解。受此启发,我们提出了Theia——一个面向机器人学习的视觉基础模型,它通过蒸馏多个基于不同视觉任务训练的现成视觉基础模型而构建。Theia丰富的视觉表征编码了多样化的视觉知识,从而提升了下游机器人学习的性能。大量实验表明,Theia在更少的训练数据和更小的模型规模下,性能优于其教师模型以及先前的机器人学习模型。此外,我们量化了预训练视觉表征的质量,并提出假设:特征范数分布中更高的熵会带来更好的机器人学习性能。代码与模型发布于 https://github.com/bdaiinstitute/theia。