MSFormer: A Skeleton-multiview Fusion Method For Tooth Instance Segmentation

Recently, deep learning-based tooth segmentation methods have been limited by the expensive and time-consuming processes of data collection and labeling. Achieving high-precision segmentation with limited datasets is critical. A viable solution to this entails fine-tuning pre-trained multiview-based models, thereby enhancing performance with limited data. However, relying solely on two-dimensional (2D) images for three-dimensional (3D) tooth segmentation can produce suboptimal outcomes because of occlusion and deformation, i.e., incomplete and distorted shape perception. To improve this fine-tuning-based solution, this paper advocates 2D-3D joint perception. The fundamental challenge in employing 2D-3D joint perception with limited data is that the 3D-related inputs and modules must follow a lightweight policy instead of using huge 3D data and parameter-rich modules that require extensive training data. Following this lightweight policy, this paper selects skeletons as the 3D inputs and introduces MSFormer, a novel method for tooth segmentation. MSFormer incorporates two lightweight modules into existing multiview-based models: a 3D-skeleton perception module to extract 3D perception from skeletons and a skeleton-image contrastive learning module to obtain the 2D-3D joint perception by fusing both multiview and skeleton perceptions. The experimental results reveal that MSFormer paired with large pre-trained multiview models achieves state-of-the-art performance, requiring only 100 training meshes. Furthermore, the segmentation accuracy is improved by 2.4%-5.5% with the increasing volume of training data.

翻译：近期，基于深度学习的牙齿分割方法受限于数据采集与标注的成本高、耗时长。如何在有限数据集上实现高精度分割至关重要。一种可行的解决方案是对预训练的多视图模型进行微调，从而在数据有限的情况下提升性能。然而，仅依赖二维（2D）图像进行三维（3D）牙齿分割可能因遮挡和形变（即不完整和扭曲的形状感知）而产生次优结果。为改进这种基于微调的方案，本文倡导采用2D-3D联合感知。在数据有限的情况下应用2D-3D联合感知的根本挑战在于：3D相关输入和模块必须遵循轻量化策略，而非使用需要大量训练数据的海量3D数据和参数密集的模块。遵循该轻量化策略，本文选取骨骼作为3D输入，并引入一种新颖的牙齿分割方法——MSFormer。MSFormer在现有基于多视图的模型中集成了两个轻量化模块：用于从骨骼中提取3D感知的3D骨骼感知模块，以及通过融合多视图感知与骨骼感知来获取2D-3D联合感知的骨骼-图像对比学习模块。实验结果表明，MSFormer结合大型预训练多视图模型仅需100个训练网格即可达到当前最佳性能。此外，随着训练数据量的增加，分割精度提升了2.4%-5.5%。