We present ModMap, a natively multiview and multimodal framework for 3D anomaly detection and segmentation. Unlike existing methods that process views independently, our method draws inspiration from the crossmodal feature mapping paradigm to learn to map features across both modalities and views, while explicitly modelling view-dependent relationships through feature-wise modulation. We introduce a cross-view training strategy that leverages all possible view combinations, enabling effective anomaly scoring through multiview ensembling and aggregation. To process high-resolution 3D data, we train and publicly release a foundational depth encoder tailored to industrial datasets. Experiments on SiM3D, a recent benchmark that introduces the first multiview and multimodal setup for 3D anomaly detection and segmentation, demonstrate that ModMap attains state-of-the-art performance by surpassing previous methods by wide margins.
翻译:我们提出ModMap——一种原生支持多视图与多模态的三维异常检测与分割框架。与现有方法独立处理各视图不同,本方法借鉴跨模态特征映射范式,学习在模态与视图间映射特征,并通过特征级调制显式建模视图依赖关系。我们引入跨视图训练策略,利用所有可能的视图组合,通过多视图集成与聚合实现高效异常评分。为处理高分辨率三维数据,我们训练并公开发布了面向工业数据集的基础深度编码器。在SiM3D(首个引入多视图与多模态设置的三维异常检测与分割基准)上的实验表明,ModMap以显著优势超越现有方法,达到最先进性能。