The goal in episodic memory (EM) is to search a long egocentric video to answer a natural language query (e.g., "where did I leave my purse?"). Existing EM methods exhaustively extract expensive fixed-length clip features to look everywhere in the video for the answer, which is infeasible for long wearable-camera videos that span hours or even days. We propose SpotEM, an approach to achieve efficiency for a given EM method while maintaining good accuracy. SpotEM consists of three key ideas: 1) a novel clip selector that learns to identify promising video regions to search conditioned on the language query; 2) a set of low-cost semantic indexing features that capture the context of rooms, objects, and interactions that suggest where to look; and 3) distillation losses that address the optimization issues arising from end-to-end joint training of the clip selector and EM model. Our experiments on 200+ hours of video from the Ego4D EM Natural Language Queries benchmark and three different EM models demonstrate the effectiveness of our approach: computing only 10% - 25% of the clip features, we preserve 84% - 97% of the original EM model's accuracy. Project page: https://vision.cs.utexas.edu/projects/spotem
翻译:情景记忆(EM)的目标是在长时间自我中心视频中搜索特定自然语言查询(例如"我把钱包落在哪里了?")的答案。现有EM方法通过穷举式提取昂贵的定长片段特征来搜索视频中所有可能位置,这对于跨越数小时甚至数天的可穿戴相机视频而言不可行。本文提出SpotEM方法,在保持良好准确性的前提下提升给定EM方法的效率。SpotEM包含三大核心思想:1)新型片段选择器,通过学习识别与语言查询相关的潜在视频区域进行搜索;2)低成本语义索引特征集,捕获房间、物体和交互等暗示搜索方向的上下文信息;3)蒸馏损失函数,解决片段选择器与EM模型端到端联合训练中的优化问题。我们在Ego4D EM自然语言查询基准的200余小时视频数据上,结合三种不同EM模型开展的实验验证了本方法的有效性:仅需计算10%-25%的片段特征,即可保留原始EM模型84%-97%的准确率。项目页面:https://vision.cs.utexas.edu/projects/spotem