This paper presents a novel hierarchical alignment model (HAM) that learns multi-granularity visual and linguistic representations in an end-to-end manner. We extract key points and proposal points to model 3D contexts and instances, and propose point-language alignment with context modulation (PLACM) mechanism, which learns to gradually align word-level and sentence-level linguistic embeddings with visual representations, while the modulation with the visual context captures latent informative relationships. To further capture both global and local relationships, we propose a spatially multi-granular modeling scheme that applies PLACM to both global and local fields. Experimental results demonstrate the superiority of HAM, with visualized results showing that it can dynamically model fine-grained visual and linguistic representations. HAM outperforms existing methods by a significant margin and achieves state-of-the-art performance on two publicly available datasets, and won the championship in ECCV 2022 ScanRefer challenge. Code is available at~\url{https://github.com/PPjmchen/HAM}.
翻译:本文提出了一种新颖的层次化对齐模型(HAM),该模型以端到端方式学习多粒度的视觉和语言表示。我们提取关键点和提议点来建模三维上下文和实例,并提出带有上下文调制的点-语言对齐(PLACM)机制,该机制逐步学习将词级和句子级语言嵌入与视觉表示对齐,同时通过视觉上下文调制捕获潜在的信息关系。为进一步捕获全局和局部关系,我们提出了一种空间多粒度建模方案,将PLACM应用于全局和局部场景。实验结果表明了HAM的优越性,可视化结果显示其能够动态建模细粒度的视觉和语言表示。HAM以显著优势超越现有方法,在两个公开数据集上取得了最先进的性能,并在ECCV 2022 ScanRefer挑战赛中夺得冠军。代码已公开于~\url{https://github.com/PPjmchen/HAM}。