Existing video captioning benchmarks and models lack coherent representations of causal-temporal narrative, which is sequences of events linked through cause and effect, unfolding over time and driven by characters or agents. This lack of narrative restricts models' ability to generate text descriptions that capture the causal and temporal dynamics inherent in video content. To address this gap, we propose NarrativeBridge, an approach comprising of: (1) a novel Causal-Temporal Narrative (CTN) captions benchmark generated using a large language model and few-shot prompting, explicitly encoding cause-effect temporal relationships in video descriptions, evaluated automatically to ensure caption quality and relevance; and (2) a dedicated Cause-Effect Network (CEN) architecture with separate encoders for capturing cause and effect dynamics independently, enabling effective learning and generation of captions with causal-temporal narrative. Extensive experiments demonstrate that CEN is more accurate in articulating the causal and temporal aspects of video content than the second best model (GIT): 17.88 and 17.44 CIDEr on the MSVD and MSR-VTT datasets, respectively. The proposed framework understands and generates nuanced text descriptions with intricate causal-temporal narrative structures present in videos, addressing a critical limitation in video captioning. For project details, visit https://narrativebridge.github.io/.
翻译:现有视频描述生成基准与模型缺乏对因果时序叙事的连贯表征,这种叙事是由角色或智能体驱动、随时间展开并通过因果关系连接的事件序列。这种叙事性的缺失限制了模型生成能够捕捉视频内容中固有的因果与时间动态的文本描述的能力。为弥补这一不足,我们提出叙事桥梁方法,包含:(1)一个新颖的因果时序叙事描述基准,该基准使用大语言模型和少样本提示生成,在视频描述中显式编码因果时序关系,并通过自动评估确保描述质量与相关性;(2)一个专用的因果效应网络架构,其配备独立的编码器以分别捕捉因果动态与效应动态,从而有效学习并生成具有因果时序叙事的描述。大量实验表明,因果效应网络在阐述视频内容的因果与时间维度上比次优模型更准确:在MSVD和MSR-VTT数据集上分别达到17.88和17.44的CIDEr分数。所提框架能够理解并生成具有视频中复杂因果时序叙事结构的细致文本描述,从而解决了视频描述生成中的一个关键局限。项目详情请访问 https://narrativebridge.github.io/。