Existing video captioning benchmarks and models lack coherent representations of causal-temporal narrative, which is sequences of events linked through cause and effect, unfolding over time and driven by characters or agents. This lack of narrative restricts models' ability to generate text descriptions that capture the causal and temporal dynamics inherent in video content. To address this gap, we propose NarrativeBridge, an approach comprising of: (1) a novel Causal-Temporal Narrative (CTN) captions benchmark generated using a large language model and few-shot prompting, explicitly encoding cause-effect temporal relationships in video descriptions, evaluated automatically to ensure caption quality and relevance and validated through human evaluation; and (2) a dedicated Cause-Effect Network (CEN) architecture with separate encoders for capturing cause and effect dynamics independently, enabling effective learning and generation of captions with causal-temporal narrative. Extensive experiments demonstrate that CEN significantly outperforms state-of-the-art models, including fine-tuned vision-language models, and is more accurate in articulating the causal and temporal aspects of video content than the second best model (GIT): 17.88 and 17.44 CIDEr on the MSVD and MSR-VTT datasets, respectively. Cross-dataset evaluations further showcase CEN's strong generalization capabilities. The proposed framework understands and generates nuanced text descriptions with intricate causal-temporal narrative structures present in videos, addressing a critical limitation in video captioning.
翻译:现有视频描述基准与模型缺乏对因果时序叙事的连贯表征,这种叙事是由角色或智能体驱动、随时间展开并通过因果关系连接的事件序列。这种叙事缺失限制了模型生成能够捕捉视频内容中固有因果与时间动态的文本描述的能力。为弥补这一不足,我们提出叙事桥梁方法,包含:(1)利用大语言模型与少样本提示生成的新型因果时序叙事描述基准,该基准在视频描述中显式编码因果时间关系,并通过自动评估确保描述质量与相关性,同时经人工评估验证;(2)专有的因果效应网络架构,采用独立编码器分别捕捉因果动态与效应动态,从而实现对具有因果时序叙事的描述进行有效学习与生成。大量实验表明,因果效应网络显著优于包括微调视觉语言模型在内的最先进模型,且在阐述视频内容的因果与时间维度上比次优模型(GIT)更为准确:在MSVD与MSR-VTT数据集上分别获得17.88与17.44的CIDEr分数。跨数据集评估进一步展现了因果效应网络强大的泛化能力。所提框架能够理解并生成具有视频中复杂因果时序叙事结构的细致文本描述,从而解决了视频描述领域的关键局限性。