Group Multi-View Transformer for 3D Shape Analysis with Spatial Encoding

In recent years, the results of view-based 3D shape recognition methods have saturated, and models with excellent performance cannot be deployed on memory-limited devices due to their huge size of parameters. To address this problem, we introduce a compression method based on knowledge distillation for this field, which largely reduces the number of parameters while preserving model performance as much as possible. Specifically, to enhance the capabilities of smaller models, we design a high-performing large model called Group Multi-view Vision Transformer (GMViT). In GMViT, the view-level ViT first establishes relationships between view-level features. Additionally, to capture deeper features, we employ the grouping module to enhance view-level features into group-level features. Finally, the group-level ViT aggregates group-level features into complete, well-formed 3D shape descriptors. Notably, in both ViTs, we introduce spatial encoding of camera coordinates as innovative position embeddings. Furthermore, we propose two compressed versions based on GMViT, namely GMViT-simple and GMViT-mini. To enhance the training effectiveness of the small models, we introduce a knowledge distillation method throughout the GMViT process, where the key outputs of each GMViT component serve as distillation targets. Extensive experiments demonstrate the efficacy of the proposed method. The large model GMViT achieves excellent 3D classification and retrieval results on the benchmark datasets ModelNet, ShapeNetCore55, and MCB. The smaller models, GMViT-simple and GMViT-mini, reduce the parameter size by 8 and 17.6 times, respectively, and improve shape recognition speed by 1.5 times on average, while preserving at least 90% of the classification and retrieval performance.

翻译：近年来，基于多视角的三维形状识别方法结果趋于饱和，性能优异的模型因参数量庞大而无法部署于内存受限设备。针对该问题，本文引入基于知识蒸馏的压缩方法，在最大程度保持模型性能的同时大幅减少参数量。具体而言，为增强小规模模型能力，我们设计了高性能大型模型——群组多视角ViT（GMViT）。该模型中，视角级ViT首先建立视角特征间的关联；其次为捕获更深层特征，我们采用分组模块将视角级特征增强为群组级特征；最终群组级ViT将群组级特征聚合为完整且形态良好的三维形状描述子。值得注意的是，两类ViT均创新性地引入基于相机坐标的空间编码作为位置嵌入。此外，基于GMViT提出两种压缩版本——GMViT-simple与GMViT-mini。为提升小模型的训练效果，我们在GMViT全流程中引入知识蒸馏方法，将各关键组件输出作为蒸馏目标。大量实验证明了所提方法的有效性。大型模型GMViT在基准数据集ModelNet、ShapeNetCore55及MCB上取得优异的三维分类与检索结果；小型模型GMViT-simple与GMViT-mini在保留至少90%分类与检索性能的前提下，参数量分别缩减8倍与17.6倍，形状识别速度平均提升1.5倍。