This paper proposes a cross-modal distillation framework, PartDistill, which transfers 2D knowledge from vision-language models (VLMs) to facilitate 3D shape part segmentation. PartDistill addresses three major challenges in this task: the lack of 3D segmentation in invisible or undetected regions in the 2D projections, inconsistent 2D predictions by VLMs, and the lack of knowledge accumulation across different 3D shapes. PartDistill consists of a teacher network that uses a VLM to make 2D predictions and a student network that learns from the 2D predictions while extracting geometrical features from multiple 3D shapes to carry out 3D part segmentation. A bi-directional distillation, including forward and backward distillations, is carried out within the framework, where the former forward distills the 2D predictions to the student network, and the latter improves the quality of the 2D predictions, which subsequently enhances the final 3D segmentation. Moreover, PartDistill can exploit generative models that facilitate effortless 3D shape creation for generating knowledge sources to be distilled. Through extensive experiments, PartDistill boosts the existing methods with substantial margins on widely used ShapeNetPart and PartNetE datasets, by more than 15% and 12% higher mIoU scores, respectively. The code for this work is available at https://github.com/ardianumam/PartDistill.
翻译:本文提出跨模态蒸馏框架PartDistill,通过迁移视觉-语言模型(VLM)的二维知识实现三维形状部件分割。PartDistill解决了该任务中的三大挑战:二维投影中不可见或未检测区域缺失三维分割、VLM预测结果存在不一致性、以及不同三维形状间知识积累不足。该框架包含使用VLM进行二维预测的教师网络,以及从二维预测中学习并提取多个三维形状几何特征以执行三维部件分割的学生网络。框架内实施双向蒸馏(前向与后向蒸馏),其中前向蒸馏将二维预测传递至学生网络,后向蒸馏则提升二维预测质量,进而增强最终的三维分割效果。此外,PartDistill可利用生成模型轻松创建三维形状,以生成可供蒸馏的知识来源。通过大量实验,PartDistill在广泛使用的ShapeNetPart和PartNetE数据集上分别以超过15%和12%的mIoU分数提升,显著优于现有方法。本工作代码已开源至https://github.com/ardianumam/PartDistill。