Mid-level vision capabilities - such as generic object localization and 3D geometric understanding - are not only fundamental to human vision but are also crucial for many real-world applications of computer vision. These abilities emerge with minimal supervision during the early stages of human visual development. Despite their significance, current self-supervised learning (SSL) approaches are primarily designed and evaluated for high-level recognition tasks, leaving their mid-level vision capabilities largely unexamined. In this study, we introduce a suite of benchmark protocols to systematically assess mid-level vision capabilities and present a comprehensive, controlled evaluation of 22 prominent SSL models across 8 mid-level vision tasks. Our experiments reveal a weak correlation between mid-level and high-level task performance. We also identify several SSL methods with highly imbalanced performance across mid-level and high-level capabilities, as well as some that excel in both. Additionally, we investigate key factors contributing to mid-level vision performance, such as pretraining objectives and network architectures. Our study provides a holistic and timely view of what SSL models have learned, complementing existing research that primarily focuses on high-level vision tasks. We hope our findings guide future SSL research to benchmark models not only on high-level vision tasks but on mid-level as well.
翻译:中层视觉能力——如通用物体定位与三维几何理解——不仅是人类视觉的基础,也对计算机视觉的诸多实际应用至关重要。这些能力在人类视觉发育的早期阶段仅需极少的监督即可形成。尽管其意义重大,当前的自监督学习方法主要针对高级识别任务进行设计与评估,其中层视觉能力在很大程度上尚未得到检验。本研究引入了一套基准测试协议以系统评估中层视觉能力,并对22个主流自监督学习模型在8项中层视觉任务上进行了全面、受控的评估。实验结果表明,中层任务与高级任务性能之间仅存在弱相关性。我们发现部分自监督学习方法在中层与高级能力上表现出严重失衡的性能特征,同时也存在一些在两方面均表现优异的模型。此外,我们探究了影响中层视觉性能的关键因素,如预训练目标与网络架构。本研究为理解自监督学习模型已习得的内容提供了整体而及时的视角,对当前主要聚焦于高级视觉任务的研究形成了重要补充。我们期望本研究的发现能够引导未来的自监督学习研究,不仅针对高级视觉任务,同时也将中层视觉能力纳入模型评估体系。