Recently, 3D shape understanding has achieved significant progress due to the advances of deep learning models on various data formats like images, voxels, and point clouds. Among them, point clouds and multi-view images are two complementary modalities of 3D objects and learning representations by fusing both of them has been proven to be fairly effective. While prior works typically focus on exploiting global features of the two modalities, herein we argue that more discriminative features can be derived by modeling ``where to fuse''. To investigate this, we propose a novel Locality-Aware Point-View Fusion Transformer (LATFormer) for 3D shape retrieval and classification. The core component of LATFormer is a module named Locality-Aware Fusion (LAF) which integrates the local features of correlated regions across the two modalities based on the co-occurrence scores. We further propose to filter out scores with low values to obtain salient local co-occurring regions, which reduces redundancy for the fusion process. In our LATFormer, we utilize the LAF module to fuse the multi-scale features of the two modalities both bidirectionally and hierarchically to obtain more informative features. Comprehensive experiments on four popular 3D shape benchmarks covering 3D object retrieval and classification validate its effectiveness.
翻译:近年来,由于深度学习模型在图像、体素和点云等不同数据格式上的进展,三维形状理解取得了显著进步。其中,点云和多视角图像是三维物体的两种互补模态,通过融合两者学习表示已被证明相当有效。以往研究通常侧重于利用两种模态的全局特征,而本文提出通过建模"何处融合"可获取更具判别力的特征。为此,我们提出了一种新颖的局部感知点视图融合Transformer(LATFormer)用于三维形状检索与分类。其核心组件为局部感知融合(LAF)模块,该模块基于共现分数整合两种模态相关区域的局部特征。我们进一步提出过滤低值分数以获得显著局部共现区域,从而减少融合过程的冗余性。在LATFormer中,我们利用LAF模块以双向分层方式融合两种模态的多尺度特征,获得更具信息量的表示。在涵盖三维物体检索与分类的四个主流基准数据集上的综合实验验证了其有效性。