We present MM-Narrator, a novel system leveraging GPT-4 with multimodal in-context learning for the generation of audio descriptions (AD). Unlike previous methods that primarily focused on downstream fine-tuning with short video clips, MM-Narrator excels in generating precise audio descriptions for videos of extensive lengths, even beyond hours, in an autoregressive manner. This capability is made possible by the proposed memory-augmented generation process, which effectively utilizes both the short-term textual context and long-term visual memory through an efficient register-and-recall mechanism. These contextual memories compile pertinent past information, including storylines and character identities, ensuring an accurate tracking and depicting of story-coherent and character-centric audio descriptions. Maintaining the training-free design of MM-Narrator, we further propose a complexity-based demonstration selection strategy to largely enhance its multi-step reasoning capability via few-shot multimodal in-context learning (MM-ICL). Experimental results on MAD-eval dataset demonstrate that MM-Narrator consistently outperforms both the existing fine-tuning-based approaches and LLM-based approaches in most scenarios, as measured by standard evaluation metrics. Additionally, we introduce the first segment-based evaluator for recurrent text generation. Empowered by GPT-4, this evaluator comprehensively reasons and marks AD generation performance in various extendable dimensions.
翻译:我们提出MM-Narrator,这是一种基于GPT-4结合多模态上下文学习的新型音频描述生成系统。与以往主要针对短视频片段进行下游微调的方法不同,MM-Narrator能够以自回归方式为长达数小时的视频生成精确的音频描述。这一能力得益于所提出的记忆增强生成过程,该过程通过高效的“注册-回忆”机制,有效利用短期文本上下文和长期视觉记忆。这些上下文记忆汇集了相关的过往信息(包括故事情节和角色身份),确保生成连贯的、以角色为中心的音频描述时能准确追踪和刻画。在保持MM-Narrator免训练设计的同时,我们进一步提出了一种基于复杂度的示例选择策略,通过少样本多模态上下文学习大幅增强其多步推理能力。在MAD-eval数据集上的实验结果表明,MM-Narrator在大多数场景下始终优于现有的基于微调和基于大语言模型的方法,并由标准评估指标验证。此外,我们引入了首个基于分段的循环文本生成评估器。该评估器借助GPT-4,能从多个可扩展维度综合推理并标注音频描述生成性能。