Learning dense visual representations without labels is an arduous task and more so from scene-centric data. We propose to tackle this challenging problem by proposing a Cross-view consistency objective with an Online Clustering mechanism (CrOC) to discover and segment the semantics of the views. In the absence of hand-crafted priors, the resulting method is more generalizable and does not require a cumbersome pre-processing step. More importantly, the clustering algorithm conjointly operates on the features of both views, thereby elegantly bypassing the issue of content not represented in both views and the ambiguous matching of objects from one crop to the other. We demonstrate excellent performance on linear and unsupervised segmentation transfer tasks on various datasets and similarly for video object segmentation. Our code and pre-trained models are publicly available at https://github.com/stegmuel/CrOC.
翻译:无标签条件下学习密集视觉表征是一项艰巨任务,尤其面对场景中心数据时更为困难。我们提出通过跨视图一致性目标与在线聚类机制(CrOC)来解决这一挑战,以发现并分割视图中的语义信息。由于无需手工先验知识,该方法具有更强的泛化能力,且无需繁琐的预处理步骤。更重要的是,该聚类算法联合对两个视图的特征进行操作,从而巧妙规避了视图间内容未对齐及跨裁剪目标模糊匹配的问题。我们在多个数据集上的线性探针和弱监督分割迁移任务中展现了卓越性能,在视频目标分割任务上同样表现优异。我们的代码与预训练模型已开源:https://github.com/stegmuel/CrOC。