Egocentric memory is widely used in embodied intelligence, but it may be insufficient for comprehensive spatial-temporal reasoning. Inspired by human recall from both field and observer perspectives, we introduce EgoExoMem, the first benchmark for cross-view memory reasoning over synchronized egocentric and exocentric videos. EgoExoMem contains $2.6K$ high-quality MCQs across eight temporal, spatial, and cross-view QA types. To support dual-view retrieval, we propose E$^2$-Select, a training-free frame selection method for synchronized ego-exo videos. It combines relevance-based budget allocation with per-view k-DPP sampling to handle view asymmetry and cross-view temporal consistency. Experiments show that ego and exo views provide complementary memory cues, while existing MLLMs remain far from solving the benchmark: the best model reaches only $55.3\%$. E$^2$-Select achieves state-of-the-art performance of $58.2\%$ over frame-selection and RAG-based memory baselines. Further analysis reveals systematic view-preference conflicts between question framing and answer grounding, underscoring the novelty and challenge of cross-view memory reasoning.
翻译:第一人称记忆在具身智能中广泛应用,但在综合时空推理中可能存在不足。受人类从场域和观察者双视角回忆的启发,我们提出EgoExoMem——首个面向同步第一人称与第三人称视频的跨视角记忆推理基准。该基准包含$2.6K$个高质量多选题,涵盖时间、空间和跨视角问答共八种类型。为支持双视角检索,我们提出无需训练的同步视角帧选择方法E$^2$-Select,该方法融合基于相关性的预算分配与单视角k-DPP采样,以处理视角不对称性和跨视角时间一致性。实验表明,第一人称与第三人称视角提供互补的记忆线索,而现有多模态大语言模型(MLLMs)在解决该基准时仍差距显著:最优模型仅达$55.3\%$。E$^2$-Select在帧选择与基于RAG的记忆基线方法中达到SOTA性能$58.2\%$。进一步分析揭示问题构建与答案定位间存在系统性视角偏好冲突,凸显了跨视角记忆推理的新颖性与挑战性。