Multi-View Hierarchical Graph Neural Network for Sketch-Based 3D Shape Retrieval

Sketch-based 3D shape retrieval (SBSR) aims to retrieve 3D shapes that are consistent with the category of the input hand-drawn sketch. The core challenge of this task lies in two aspects: existing methods typically employ simplified aggregation strategies for independently encoded 3D multi-view features, which ignore the geometric relationships between views and multi-level details, resulting in weak 3D representation. Simultaneously, traditional SBSR methods are constrained by visible category limitations, leading to poor performance in zero-shot scenarios. To address these challenges, we propose Multi-View Hierarchical Graph Neural Network (MV-HGNN), a novel framework for SBSR. Specifically, we construct a view-level graph and capture adjacent geometric dependencies and cross-view message passing via local graph convolution and global attention. A view selector is further introduced to perform hierarchical graph coarsening, enabling a progressively larger receptive field for graph convolution and mitigating the interference of redundant views, which leads to more discriminate discriminative hierarchical 3D representation. To enable category agnostic alignment and mitigate overfitting to seen classes, we leverage CLIP text embeddings as semantic prototypes and project both sketch and 3D features into a shared semantic space. We use a two-stage training strategy for category-level retrieval and a one-stage strategy for zero-shot retrieval under the same model architecture. Under both category-level and zero-shot settings, extensive experiments on two public benchmarks demonstrate that MV-HGNN outperforms state-of-the-art methods.

翻译：基于草图的三维形状检索（SBSR）旨在检索与输入手绘草图类别一致的三维形状。该任务的核心挑战体现在两个方面：现有方法通常对独立编码的三维多视图特征采用简化的聚合策略，忽略了视图间的几何关系及多层级细节，导致三维表示能力较弱；同时，传统SBSR方法受限于可见类别约束，在零样本场景下表现不佳。针对上述挑战，我们提出了多视图层次图神经网络（MV-HGNN），一种用于SBSR的新型框架。具体而言，我们构建视图级图，通过局部图卷积和全局注意力捕捉相邻几何依赖关系及跨视图信息传递。进一步引入视图选择器执行层次图粗化，使图卷积的接收域逐步扩大，并抑制冗余视图干扰，从而获得更具判别性的分层三维表示。为实现类别无关对齐并缓解对可见类别的过拟合，我们利用CLIP文本嵌入作为语义原型，将草图与三维特征投影至共享语义空间。在相同模型架构下，采用两阶段训练策略进行类别级检索，一阶段策略进行零样本检索。在类别级与零样本两种设定下，基于两个公开基准的大量实验表明，MV-HGNN优于现有最先进方法。