In this paper we introduce LifelongMemory, a new framework for accessing long-form egocentric videographic memory through natural language question answering and retrieval. LifelongMemory generates concise video activity descriptions of the camera wearer and leverages the zero-shot capabilities of pretrained large language models to perform reasoning over long-form video context. Furthermore, Lifelong Memory uses a confidence and explanation module to produce confident, high-quality, and interpretable answers. Our approach achieves state-of-the-art performance on the EgoSchema benchmark for question answering and is highly competitive on the natural language query (NLQ) challenge of Ego4D. Code is available at https://github.com/Agentic-Learning-AI-Lab/lifelong-memory.
翻译:本文提出了LifelongMemory,一种通过自然语言问答和检索访问长形式自我中心视频记忆的新框架。LifelongMemory生成摄像头佩戴者的简洁视频活动描述,并利用预训练大语言模型的零样本能力对长形式视频上下文进行推理。此外,LifelongMemory采用置信度与解释模块,生成可靠、高质量且可解释的答案。本方法在EgoSchema基准的问答任务中取得了最先进性能,并且在Ego4D的自然语言查询挑战中极具竞争力。代码发布于https://github.com/Agentic-Learning-AI-Lab/lifelong-memory。