VTPNet for 3D deep learning on point cloud

Recently, Transformer-based methods for point cloud learning have achieved good results on various point cloud learning benchmarks. However, since the attention mechanism needs to generate three feature vectors of query, key, and value to calculate attention features, most of the existing Transformer-based point cloud learning methods usually consume a large amount of computational time and memory resources when calculating global attention. To address this problem, we propose a Voxel-Transformer-Point (VTP) Block for extracting local and global features of point clouds. VTP combines the advantages of voxel-based, point-based and Transformer-based methods, which consists of Voxel-Based Branch (V branch), Point-Based Transformer Branch (PT branch) and Point-Based Branch (P branch). The V branch extracts the coarse-grained features of the point cloud through low voxel resolution; the PT branch obtains the fine-grained features of the point cloud by calculating the self-attention in the local neighborhood and the inter-neighborhood cross-attention; the P branch uses a simplified MLP network to generate the global location information of the point cloud. In addition, to enrich the local features of point clouds at different scales, we set the voxel scale in the V branch and the neighborhood sphere scale in the PT branch to one large and one small (large voxel scale \& small neighborhood sphere scale or small voxel scale \& large neighborhood sphere scale). Finally, we use VTP as the feature extraction network to construct a VTPNet for point cloud learning, and performs shape classification, part segmentation, and semantic segmentation tasks on the ModelNet40, ShapeNet Part, and S3DIS datasets. The experimental results indicate that VTPNet has good performance in 3D point cloud learning.

翻译：近期，基于Transformer的点云学习方法在多项基准测试中取得了优异成果。然而，由于注意力机制需生成查询、键、值三个特征向量以计算注意力特征，现有基于Transformer的点云学习方法在计算全局注意力时通常消耗大量计算时间和内存资源。针对该问题，我们提出一种体素-Transformer-点（VTP）块，用于提取点云的局部与全局特征。VTP融合了基于体素、基于点及基于Transformer方法的优势，包含基于体素的分支（V分支）、基于点的Transformer分支（PT分支）和基于点的分支（P分支）。V分支通过低体素分辨率提取点云的粗粒度特征；PT分支通过计算局部邻域内的自注意力与邻域间的交叉注意力获取点云的细粒度特征；P分支利用简化多层感知器网络生成点云的全局位置信息。此外，为丰富不同尺度下的点云局部特征，我们将V分支中的体素尺度与PT分支中的邻域球尺度设置为一大一小（大体素尺度&小邻域球尺度或小体素尺度&大邻域球尺度）。最后，我们以VTP作为特征提取网络构建用于点云学习的VTPNet，并在ModelNet40、ShapeNet Part和S3DIS数据集上执行形状分类、部件分割及语义分割任务。实验结果表明，VTPNet在三维点云学习中具有良好性能。