In recent years, the results of view-based 3D shape recognition methods have saturated, and models with excellent performance cannot be deployed on memory-limited devices due to their huge size of parameters. To address this problem, we introduce a compression method based on knowledge distillation for this field, which largely reduces the number of parameters while preserving model performance as much as possible. Specifically, to enhance the capabilities of smaller models, we design a high-performing large model called Group Multi-view Vision Transformer (GMViT). In GMViT, the view-level ViT first establishes relationships between view-level features. Additionally, to capture deeper features, we employ the grouping module to enhance view-level features into group-level features. Finally, the group-level ViT aggregates group-level features into complete, well-formed 3D shape descriptors. Notably, in both ViTs, we introduce spatial encoding of camera coordinates as innovative position embeddings. Furthermore, we propose two compressed versions based on GMViT, namely GMViT-simple and GMViT-mini. To enhance the training effectiveness of the small models, we introduce a knowledge distillation method throughout the GMViT process, where the key outputs of each GMViT component serve as distillation targets. Extensive experiments demonstrate the efficacy of the proposed method. The large model GMViT achieves excellent 3D classification and retrieval results on the benchmark datasets ModelNet, ShapeNetCore55, and MCB. The smaller models, GMViT-simple and GMViT-mini, reduce the parameter size by 8 and 17.6 times, respectively, and improve shape recognition speed by 1.5 times on average, while preserving at least 90% of the classification and retrieval performance. The code is available at https://github.com/bigdata-graph/GMViT.
翻译:近年来,基于视图的三维形状识别方法性能趋于饱和,而高性能模型因其参数量庞大难以部署在内存受限设备上。为解决此问题,我们首次为该领域引入基于知识蒸馏的模型压缩方法,在尽可能保持模型性能的同时大幅减少参数量。具体而言,为增强小模型能力,我们设计了高性能大模型——群组多视角视觉Transformer(GMViT)。该模型中,视图级ViT首先建立视图级特征间的关系;为进一步挖掘深层特征,我们采用分组模块将视图级特征增强为群组级特征;最终,群组级ViT将群组级特征聚合为完整、结构良好的三维形状描述符。值得注意的是,在两个ViT模块中,我们创新性地引入相机坐标的空间编码作为位置嵌入。此外,我们基于GMViT提出两个压缩版本:GMViT-simple与GMViT-mini。为提升小模型训练效果,我们在GMViT全流程中引入知识蒸馏方法,以GMViT各核心组件的输出作为蒸馏目标。大量实验验证了所提方法的有效性:大模型GMViT在ModelNet、ShapeNetCore55和MCB基准数据集上取得了优异的三维分类与检索结果;小模型GMViT-simple和GMViT-mini分别将参数量压缩至1/8和1/17.6,形状识别速度平均提升1.5倍,同时保持至少90%的分类与检索性能。代码已开源:https://github.com/bigdata-graph/GMViT。