Conventional frame-based cameras capture rich contextual information but suffer from limited temporal resolution and motion blur in dynamic scenes. Event cameras offer an alternative visual representation with higher dynamic range free from such limitations. The complementary characteristics of the two modalities make event-frame asymmetric stereo promising for reliable 3D perception under fast motion and challenging illumination. However, the modality gap often leads to marginalization of domain-specific cues essential for cross-modal stereo matching. In this paper, we introduce Bi-CMPStereo, a novel bidirectional cross-modal prompting framework that fully exploits semantic and structural features from both domains for robust matching. Our approach learns finely aligned stereo representations within a target canonical space and integrates complementary representations by projecting each modality into both event and frame domains. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in accuracy and generalization.
翻译:传统帧基相机能够捕获丰富的上下文信息,但在动态场景中存在时间分辨率有限和运动模糊的问题。事件相机提供的替代视觉表示具有更高动态范围,且不受此类限制。两种模态的互补特性使得事件-帧非对称立体匹配在快速运动和挑战性光照条件下实现可靠的3D感知具有前景。然而,模态差异常导致跨模态立体匹配中关键的领域特定特征被边缘化。本文提出Bi-CMPStereo,一种新颖的双向跨模态提示框架,能够充分开发两个领域的语义和结构特征以实现鲁棒匹配。我们的方法在目标规范空间内学习精细对齐的立体表示,并通过将每个模态投影到事件域和帧域中来整合互补表示。大量实验表明,我们的方法在精度和泛化能力上显著优于现有最先进方法。