Group Multi-View Transformer for 3D Shape Analysis with Spatial Encoding

In recent years, the results of view-based 3D shape recognition methods have saturated, and models with excellent performance cannot be deployed on memory-limited devices due to their huge size of parameters. To address this problem, we introduce a compression method based on knowledge distillation for this field, which largely reduces the number of parameters while preserving model performance as much as possible. Specifically, to enhance the capabilities of smaller models, we design a high-performing large model called Group Multi-view Vision Transformer (GMViT). In GMViT, the view-level ViT first establishes relationships between view-level features. Additionally, to capture deeper features, we employ the grouping module to enhance view-level features into group-level features. Finally, the group-level ViT aggregates group-level features into complete, well-formed 3D shape descriptors. Notably, in both ViTs, we introduce spatial encoding of camera coordinates as innovative position embeddings. Furthermore, we propose two compressed versions based on GMViT, namely GMViT-simple and GMViT-mini. To enhance the training effectiveness of the small models, we introduce a knowledge distillation method throughout the GMViT process, where the key outputs of each GMViT component serve as distillation targets. Extensive experiments demonstrate the efficacy of the proposed method. The large model GMViT achieves excellent 3D classification and retrieval results on the benchmark datasets ModelNet, ShapeNetCore55, and MCB. The smaller models, GMViT-simple and GMViT-mini, reduce the parameter size by 8 and 17.6 times, respectively, and improve shape recognition speed by 1.5 times on average, while preserving at least 90% of the classification and retrieval performance. The code is available at https://github.com/bigdata-graph/GMViT.

翻译：近年来，基于视图的三维形状识别方法性能趋于饱和，而高性能模型因其参数量庞大难以部署在内存受限设备上。为解决此问题，我们首次为该领域引入基于知识蒸馏的模型压缩方法，在尽可能保持模型性能的同时大幅减少参数量。具体而言，为增强小模型能力，我们设计了高性能大模型——群组多视角视觉Transformer（GMViT）。该模型中，视图级ViT首先建立视图级特征间的关系；为进一步挖掘深层特征，我们采用分组模块将视图级特征增强为群组级特征；最终，群组级ViT将群组级特征聚合为完整、结构良好的三维形状描述符。值得注意的是，在两个ViT模块中，我们创新性地引入相机坐标的空间编码作为位置嵌入。此外，我们基于GMViT提出两个压缩版本：GMViT-simple与GMViT-mini。为提升小模型训练效果，我们在GMViT全流程中引入知识蒸馏方法，以GMViT各核心组件的输出作为蒸馏目标。大量实验验证了所提方法的有效性：大模型GMViT在ModelNet、ShapeNetCore55和MCB基准数据集上取得了优异的三维分类与检索结果；小模型GMViT-simple和GMViT-mini分别将参数量压缩至1/8和1/17.6，形状识别速度平均提升1.5倍，同时保持至少90%的分类与检索性能。代码已开源：https://github.com/bigdata-graph/GMViT。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日