Monocular 3D detection (M3D) aims for precise 3D object localization from a single-view image which usually involves labor-intensive annotation of 3D detection boxes. Weakly supervised M3D has recently been studied to obviate the 3D annotation process by leveraging many existing 2D annotations, but it often requires extra training data such as LiDAR point clouds or multi-view images which greatly degrades its applicability and usability in various applications. We propose SKD-WM3D, a weakly supervised monocular 3D detection framework that exploits depth information to achieve M3D with a single-view image exclusively without any 3D annotations or other training data. One key design in SKD-WM3D is a self-knowledge distillation framework, which transforms image features into 3D-like representations by fusing depth information and effectively mitigates the inherent depth ambiguity in monocular scenarios with little computational overhead in inference. In addition, we design an uncertainty-aware distillation loss and a gradient-targeted transfer modulation strategy which facilitate knowledge acquisition and knowledge transfer, respectively. Extensive experiments show that SKD-WM3D surpasses the state-of-the-art clearly and is even on par with many fully supervised methods.
翻译:单目三维检测(M3D)旨在从单视图图像中实现精确的三维目标定位,通常涉及耗时费力的三维检测框标注。近期针对弱监督M3D的研究试图通过利用大量已有的二维标注来规避三维标注过程,但往往需要额外的训练数据(如激光雷达点云或多视图图像),这严重限制了其在各类应用中的适用性和实用性。我们提出SKD-WM3D——一种弱监督单目三维检测框架,该框架利用深度信息,仅凭单视图图像即可实现M3D,无需任何三维标注或其他训练数据。SKD-WM3D的核心设计在于一种自知识蒸馏框架,通过融合深度信息将图像特征转化为类三维表示,有效缓解单目场景中固有的深度模糊性,且推理时计算开销极小。此外,我们设计了不确定性感知蒸馏损失和梯度定向迁移调制策略,分别促进知识获取与知识迁移。大量实验表明,SKD-WM3D显著超越现有最优方法,并可与多种全监督方法相媲美。