In this work, we propose the use of "aligned visual captions" as a mechanism for integrating information contained within videos into retrieval augmented generation (RAG) based chat assistant systems. These captions are able to describe the visual and audio content of videos in a large corpus while having the advantage of being in a textual format that is both easy to reason about & incorporate into large language model (LLM) prompts, but also typically require less multimedia content to be inserted into the multimodal LLM context window, where typical configurations can aggressively fill up the context window by sampling video frames from the source video. Furthermore, visual captions can be adapted to specific use cases by prompting the original foundational model / captioner for particular visual details or fine tuning. In hopes of helping advancing progress in this area, we curate a dataset and describe automatic evaluation procedures on common RAG tasks.
翻译:在本研究中,我们提出使用"对齐视觉字幕"作为将视频中包含的信息整合到基于检索增强生成(RAG)的聊天辅助系统中的机制。这些字幕能够描述大规模视频库中的视觉与音频内容,同时具备文本格式的优势:既便于推理并整合至大语言模型(LLM)提示中,又通常需要更少的多媒体内容插入多模态LLM上下文窗口——传统方法通过从源视频采样视频帧极易快速占满上下文窗口。此外,视觉字幕可通过提示原始基础模型/字幕生成器关注特定视觉细节或进行微调,从而适配具体应用场景。为推进该领域发展,我们构建了一个数据集并描述了针对常见RAG任务的自动评估流程。