AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genuinely useful, such systems must move beyond short-term video comprehension and address memory gaps that humans experience for practical, personal, or social purposes over longitudinal egocentric video streams. However, existing egocentric datasets predominantly focus on action recognition or generic QAs from short clips, measuring perceptual capabilities rather than realistic human memory needs. We introduce SuperMemory-VQA, an egocentric visual question answering (VQA) dataset for evaluating AI assistants on practical, long-horizon memory tasks. It contains 52.9 hours of everyday activities recorded with AI glasses, including synchronized RGB video, audio transcription, eye gaze, IMU, and SLAM trajectories. Through a human-verified annotation pipeline, we construct grounded 4,853 question-answer pairs that span object and location memory, intent recall, visual scene recall, timeline reconstruction, conversational memory, and in-context retrieval. Each question is posed as multiple-choice with an explicit "unanswerable" option to test hallucination robustness. Benchmarking leading agentic frameworks and LLM backbones reveals that existing systems remain far from reliable on real-world memory tasks, highlighting the need for new architectures for grounded AI memory that can answer only when evidence is sufficient. A participant survey further supports that our questions are realistic, useful, and aligned with everyday memory needs.
翻译:AI眼镜为AI代理人作为个性化记忆助手提供了引人注目的平台。要真正发挥效用,此类系统必须超越短期视频理解,解决人类在纵向自我中心视频流中因实际、个人或社交目的而经历的记忆缺口。然而,现有自我中心数据集主要聚焦于短片段的行为识别或通用问答,评测的是感知能力而非真实的人类记忆需求。我们推出SuperMemory-VQA,这是一个用于评估AI助手在实用、长时间跨度记忆任务上表现的自我中心视觉问答数据集。该数据集包含52.9小时使用AI眼镜记录的日常活动,包括同步的RGB视频、音频转录、眼动轨迹、惯性测量单元数据和SLAM轨迹。通过人工验证的标注流程,我们构建了4,853个经过实证的问题-答案对,涵盖物体与位置记忆、意图回忆、视觉场景回忆、时间线重建、对话记忆及上下文检索。每个问题均以选择题形式呈现,并设有明确的“不可回答”选项,以测试幻觉鲁棒性。对领先智能体框架及大语言模型骨干的基准测试表明,现有系统在实际记忆任务上仍远未达到可靠水平,这凸显了开发新型架构以支持仅在证据充分时才能回答的基于实证AI记忆的必要性。参与者调查进一步证实,我们的问题具有现实性、实用性,且符合日常记忆需求。