Multi-modal 3D scene understanding has gained considerable attention due to its wide applications in many areas, such as autonomous driving and human-computer interaction. Compared to conventional single-modal 3D understanding, introducing an additional modality not only elevates the richness and precision of scene interpretation but also ensures a more robust and resilient understanding. This becomes especially crucial in varied and challenging environments where solely relying on 3D data might be inadequate. While there has been a surge in the development of multi-modal 3D methods over past three years, especially those integrating multi-camera images (3D+2D) and textual descriptions (3D+language), a comprehensive and in-depth review is notably absent. In this article, we present a systematic survey of recent progress to bridge this gap. We begin by briefly introducing a background that formally defines various 3D multi-modal tasks and summarizes their inherent challenges. After that, we present a novel taxonomy that delivers a thorough categorization of existing methods according to modalities and tasks, exploring their respective strengths and limitations. Furthermore, comparative results of recent approaches on several benchmark datasets, together with insightful analysis, are offered. Finally, we discuss the unresolved issues and provide several potential avenues for future research.
翻译:多模态三维场景理解因其在自动驾驶、人机交互等众多领域的广泛应用而受到广泛关注。相较于传统的单模态三维理解,引入额外模态不仅提升了场景解释的丰富性与精确度,还确保了更稳健和更具适应性的理解能力。这在仅依赖三维数据可能不足的多样化和具有挑战性的环境中尤为关键。尽管过去三年中多模态三维方法(特别是整合多摄像头图像(3D+2D)与文本描述(3D+语言)的方法)发展迅猛,但目前仍缺乏全面而深入的综述。为填补这一空白,本文对近期进展进行了系统性调研。首先简要介绍背景,正式定义各类三维多模态任务并总结其固有挑战。随后提出一种新颖的分类体系,根据模态与任务对现有方法进行详尽归类,探讨各自的优势与局限性。此外,提供了近期方法在多个基准数据集上的对比结果及深入分析。最后,讨论未解决的关键问题,并展望未来研究的潜在方向。