In this paper, we investigate an open research task of cross-modal retrieval between 3D shapes and textual descriptions. Previous approaches mainly rely on point cloud encoders for feature extraction, which may ignore key inherent features of 3D shapes, including depth, spatial hierarchy, geometric continuity, etc. To address this issue, we propose COM3D, making the first attempt to exploit the cross-view correspondence and cross-modal mining to enhance the retrieval performance. Notably, we augment the 3D features through a scene representation transformer, to generate cross-view correspondence features of 3D shapes, which enrich the inherent features and enhance their compatibility with text matching. Furthermore, we propose to optimize the cross-modal matching process based on the semi-hard negative example mining method, in an attempt to improve the learning efficiency. Extensive quantitative and qualitative experiments demonstrate the superiority of our proposed COM3D, achieving state-of-the-art results on the Text2Shape dataset.
翻译:本文研究三维形状与文本描述之间跨模态检索这一开放研究任务。现有方法主要依赖点云编码器进行特征提取,可能忽略三维形状的关键内在特征,包括深度、空间层次结构、几何连续性等。针对这一问题,我们提出COM3D,首次尝试利用跨视角对应与跨模态挖掘来提升检索性能。具体而言,我们通过场景表征变换器增强三维特征,生成三维形状的跨视角对应特征,从而丰富其内在特征并增强与文本匹配的兼容性。此外,我们提出基于半难负样本挖掘方法优化跨模态匹配过程,旨在提高学习效率。大量定量与定性实验证明了所提COM3D的优越性,在Text2Shape数据集上取得了最先进的结果。