Self-supervised image networks can be used to address complex 2D tasks (e.g., semantic segmentation, object discovery) very efficiently and with little or no downstream supervision. However, self-supervised 3D networks on lidar data do not perform as well for now. A few methods therefore propose to distill high-quality self-supervised 2D features into 3D networks. The most recent ones doing so on autonomous driving data show promising results. Yet, a performance gap persists between these distilled features and fully-supervised ones. In this work, we revisit 2D-to-3D distillation. First, we propose, for semantic segmentation, a simple approach that leads to a significant improvement compared to prior 3D distillation methods. Second, we show that distillation in high capacity 3D networks is key to reach high quality 3D features. This actually allows us to significantly close the gap between unsupervised distilled 3D features and fully-supervised ones. Last, we show that our high-quality distilled representations can also be used for open-vocabulary segmentation and background/foreground discovery.
翻译:自监督图像网络能够高效地处理复杂2D任务(如语义分割、目标发现),且仅需少量或无需下游监督。然而,基于激光雷达数据的自监督3D网络目前表现尚不理想。因此,有方法提出将高质量的自监督2D特征蒸馏到3D网络中。最新应用于自动驾驶数据的方法展示了令人鼓舞的结果,但这些蒸馏特征与全监督特征之间仍存在性能差距。本研究对2D到3D蒸馏进行了再探索。首先,针对语义分割任务,我们提出了一种简单方法,相较于先前的3D蒸馏方法取得了显著提升。其次,我们证明在高容量3D网络中进行蒸馏是实现高质量3D特征的关键,这实际上使得无监督蒸馏3D特征与全监督特征之间的差距大幅缩小。最后,我们展示所获的高质量蒸馏表示还可用于开放词汇分割及背景/前景发现。