Visual representation learning hold great promise for robotics, but is severely hampered by the scarcity and homogeneity of robotics datasets. Recent works address this problem by pre-training visual representations on large-scale but out-of-domain data (e.g., videos of egocentric interactions) and then transferring them to target robotics tasks. While the field is heavily focused on developing better pre-training algorithms, we find that dataset choice is just as important to this paradigm's success. After all, the representation can only learn the structures or priors present in the pre-training dataset. To this end, we flip the focus on algorithms, and instead conduct a dataset centric analysis of robotic pre-training. Our findings call into question some common wisdom in the field. We observe that traditional vision datasets (like ImageNet, Kinetics and 100 Days of Hands) are surprisingly competitive options for visuo-motor representation learning, and that the pre-training dataset's image distribution matters more than its size. Finally, we show that common simulation benchmarks are not a reliable proxy for real world performance and that simple regularization strategies can dramatically improve real world policy learning. https://data4robotics.github.io
翻译:视觉表征学习在机器人领域具有广阔前景,但机器人数据集的稀缺性和同质性严重制约了其发展。近期研究通过在大规模但领域外数据(如第一人称交互视频)上预训练视觉表征,再迁移到目标任务,以此解决该问题。尽管该领域主要聚焦于改进预训练算法,我们发现数据集选择对范式成功同等关键——毕竟表征只能学习预训练数据集中存在的结构或先验。为此,我们颠覆算法优先的研究视角,转向以数据集为中心的机器人预训练分析。研究发现对领域内部分常识性认知提出了质疑:传统视觉数据集(如ImageNet、Kinetics和100 Days of Hands)在视觉-运动表征学习中展现出惊人的竞争力,且预训练数据集的图像分布比规模更重要。最后,我们揭示当前仿真基准不能可靠反映真实世界性能,而简单正则化策略可显著提升真实场景策略学习效果。https://data4robotics.github.io