Mainstream 3D representation learning approaches are built upon contrastive or generative modeling pretext tasks, where great improvements in performance on various downstream tasks have been achieved. However, we find these two paradigms have different characteristics: (i) contrastive models are data-hungry that suffer from a representation over-fitting issue; (ii) generative models have a data filling issue that shows inferior data scaling capacity compared to contrastive models. This motivates us to learn 3D representations by sharing the merits of both paradigms, which is non-trivial due to the pattern difference between the two paradigms. In this paper, we propose Contrast with Reconstruct (ReCon) that unifies these two paradigms. ReCon is trained to learn from both generative modeling teachers and single/cross-modal contrastive teachers through ensemble distillation, where the generative student guides the contrastive student. An encoder-decoder style ReCon-block is proposed that transfers knowledge through cross attention with stop-gradient, which avoids pretraining over-fitting and pattern difference issues. ReCon achieves a new state-of-the-art in 3D representation learning, e.g., 91.26% accuracy on ScanObjectNN. Codes have been released at https://github.com/qizekun/ReCon.
翻译:主流的三维表示学习方法基于对比式或生成式预训练任务构建,已在各类下游任务中取得显著性能提升。然而,我们发现这两种范式存在不同特性:(i) 对比模型存在数据饥渴问题,易陷入表示过拟合;(ii) 生成模型存在数据填充问题,相较于对比模型其数据扩展能力较弱。这促使我们融合两种范式的优势来学习三维表示,但范式间的模式差异使得该目标颇具挑战。本文提出对比与重构(ReCon)统一框架,通过集成蒸馏同时向生成式建模教师和单模态/跨模态对比教师学习,其中生成式学生模型引导对比式学生模型。我们提出编码器-解码器结构的ReCon模块,通过带停止梯度的交叉注意力传递知识,有效规避预训练过拟合与范式差异问题。ReCon在三维表示学习中达到新标杆,例如在ScanObjectNN数据集上取得91.26%准确率。代码已开源至https://github.com/qizekun/ReCon。