Robust Ego-Exo Correspondence with Long-Term Memory

Establishing object-level correspondence between egocentric and exocentric views is essential for intelligent assistants to deliver precise and intuitive visual guidance. However, this task faces numerous challenges, including extreme viewpoint variations, occlusions, and the presence of small objects. Existing approaches usually borrow solutions from video object segmentation models, but still suffer from the aforementioned challenges. Recently, the Segment Anything Model 2 (SAM 2) has shown strong generalization capabilities and excellent performance in video object segmentation. Yet, when simply applied to the ego-exo correspondence (EEC) task, SAM 2 encounters severe difficulties due to ineffective ego-exo feature fusion and limited long-term memory capacity, especially for long videos. Addressing these problems, we propose a novel EEC framework based on SAM 2 with long-term memories by presenting a dual-memory architecture and an adaptive feature routing module inspired by Mixture-of-Experts (MoE). Compared to SAM 2, our approach features (i) a Memory-View MoE module which consists of a dual-branch routing mechanism to adaptively assign contribution weights to each expert feature along both channel and spatial dimensions, and (ii) a dual-memory bank system with a simple yet effective compression strategy to retain critical long-term information while eliminating redundancy. In the extensive experiments on the challenging EgoExo4D benchmark, our method, dubbed LM-EEC, achieves new state-of-the-art results and significantly outperforms existing methods and the SAM 2 baseline, showcasing its strong generalization across diverse scenarios. Our code and model are available at https://github.com/juneyeeHu/LM-EEC.

翻译：在自我中心视角与外部视角之间建立物体级对应关系，对于智能助手提供精确且直观的视觉指导至关重要。然而，该任务面临诸多挑战，包括极端视角变化、遮挡以及小物体的存在。现有方法通常借鉴视频物体分割模型的解决方案，但仍受上述挑战困扰。近期，分割一切模型2（SAM 2）在视频物体分割领域展现了强大的泛化能力和卓越性能。然而，当直接应用于自我-外部视角对应（EEC）任务时，由于无效的特征融合和有限的长时记忆容量（尤其在长视频中），SAM 2遭遇严重困难。针对这些问题，我们提出了一种基于配备长时记忆的SAM 2的新型EEC框架，通过引入受专家混合（MoE）启发的双记忆架构和自适应特征路由模块。与SAM 2相比，本方法具有以下特点：(i) 记忆-视图MoE模块，包含双分支路由机制，可自适应地为每个专家特征在通道和空间维度上分配贡献权重；(ii) 双记忆库系统，采用简单而有效的压缩策略，在保留关键长时信息的同时消除冗余。在具有挑战性的EgoExo4D基准测试上进行的大量实验中，我们的方法（命名为LM-EEC）取得了新的最先进结果，显著优于现有方法和SAM 2基线，展示了其在多样场景下的强大泛化能力。我们的代码和模型已开源至https://github.com/juneyeeHu/LM-EEC。