Polygonal meshes have become the standard for discretely approximating 3D shapes, thanks to their efficiency and high flexibility in capturing non-uniform shapes. This non-uniformity, however, leads to irregularity in the mesh structure, making tasks like segmentation of 3D meshes particularly challenging. Semantic segmentation of 3D mesh has been typically addressed through CNN-based approaches, leading to good accuracy. Recently, transformers have gained enough momentum both in NLP and computer vision fields, achieving performance at least on par with CNN models, supporting the long-sought architecture universalism. Following this trend, we propose a transformer-based method for semantic segmentation of 3D mesh motivated by a better modeling of the graph structure of meshes, by means of global attention mechanisms. In order to address the limitations of standard transformer architectures in modeling relative positions of non-sequential data, as in the case of 3D meshes, as well as in capturing the local context, we perform positional encoding by means the Laplacian eigenvectors of the adjacency matrix, replacing the traditional sinusoidal positional encodings, and by introducing clustering-based features into the self-attention and cross-attention operators. Experimental results, carried out on three sets of the Shape COSEG Dataset, on the human segmentation dataset proposed in Maron et al., 2017 and on the ShapeNet benchmark, show how the proposed approach yields state-of-the-art performance on semantic segmentation of 3D meshes.
翻译:多边形网格凭借其在捕捉非均匀形状时的高效性与高灵活性,已成为三维形状离散近似的标准方法。然而,这种非均匀性导致网格结构的不规则性,使得诸如三维网格分割等任务颇具挑战性。三维网格的语义分割通常通过基于CNN的方法实现,取得了良好的精度。近年来,Transformer在自然语言处理和计算机视觉领域已积累足够的发展动力,其性能至少与CNN模型相当,支持了长期追求架构通用性的趋势。沿此趋势,我们提出一种基于Transformer的三维网格语义分割方法,旨在通过全局注意力机制更好地建模网格的图结构。为应对标准Transformer架构在建模非序列数据(如三维网格)相对位置及捕捉局部上下文方面的局限性,我们采用邻接矩阵的拉普拉斯特征向量进行位置编码以替代传统的正弦位置编码,并将聚类特征引入自注意力和交叉注意力算子。在Shape COSEG数据集的三个子集、Maron等人2017年提出的人体分割数据集以及ShapeNet基准上开展的实验结果表明,所提方法在三维网格语义分割任务上达到了最先进的性能。