Language has become a prominent modality in computer vision with the rise of LLMs. Despite supporting long context-lengths, their effectiveness in handling long-term information gradually declines with input length. This becomes critical, especially in applications such as long-form video understanding. In this paper, we introduce a Language Repository (LangRepo) for LLMs, that maintains concise and structured information as an interpretable (i.e., all-textual) representation. Our repository is updated iteratively based on multi-scale video chunks. We introduce write and read operations that focus on pruning redundancies in text, and extracting information at various temporal scales. The proposed framework is evaluated on zero-shot visual question-answering benchmarks including EgoSchema, NExT-QA, IntentQA and NExT-GQA, showing state-of-the-art performance at its scale. Our code is available at https://github.com/kkahatapitiya/LangRepo.
翻译:随着大语言模型(LLM)的兴起,语言已成为计算机视觉领域的一个重要模态。尽管LLM支持长上下文,但其处理长期信息的有效性会随着输入长度的增加而逐渐下降。这在长视频理解等应用中尤为关键。本文为大语言模型引入了一个语言知识库(LangRepo),该知识库以可解释(即全文本)的表示形式维护简洁且结构化的信息。我们的知识库基于多尺度视频片段进行迭代更新。我们引入了写操作和读操作,前者专注于修剪文本中的冗余信息,后者则用于提取不同时间尺度上的信息。所提框架在包括EgoSchema、NExT-QA、IntentQA和NExT-GQA在内的零样本视觉问答基准上进行了评估,在其规模上展现了最先进的性能。我们的代码发布于 https://github.com/kkahatapitiya/LangRepo。