Causal Video Question Answering (CVidQA) queries not only association or temporal relations but also causal relations in a video. Existing question synthesis methods pre-trained question generation (QG) systems on reading comprehension datasets with text descriptions as inputs. However, QG models only learn to ask association questions (e.g., ``what is someone doing...'') and result in inferior performance due to the poor transfer of association knowledge to CVidQA, which focuses on causal questions like ``why is someone doing ...''. Observing this, we proposed to exploit causal knowledge to generate question-answer pairs, and proposed a novel framework, Causal Knowledge Extraction from Language Models (CaKE-LM), leveraging causal commonsense knowledge from language models to tackle CVidQA. To extract knowledge from LMs, CaKE-LM generates causal questions containing two events with one triggering another (e.g., ``score a goal'' triggers ``soccer player kicking ball'') by prompting LM with the action (soccer player kicking ball) to retrieve the intention (to score a goal). CaKE-LM significantly outperforms conventional methods by 4% to 6% of zero-shot CVidQA accuracy on NExT-QA and Causal-VidQA datasets. We also conduct comprehensive analyses and provide key findings for future research.
翻译:因果视频问答(CVidQA)不仅查询视频中的关联或时序关系,还查询因果关系。现有的问题合成方法在阅读理解数据集上预训练问题生成(QG)系统,并以文本描述作为输入。然而,QG模型仅学习提出关联性问题(例如,“某人在做什么……”),由于将关联性知识迁移到专注于因果性问题(如“某人为什么在做什么……”)的CVidQA中效果不佳,导致性能较差。基于此观察,我们提出利用因果知识生成问答对,并设计了一个新颖框架——从语言模型中提取因果知识(CaKE-LM),利用语言模型中的因果常识知识解决CVidQA任务。为从语言模型(LM)中提取知识,CaKE-LM通过向LM提示某一动作(如“足球运动员踢球”)来检索其意图(如“得分”),从而生成包含两个事件(一个触发另一个)的因果性问题(例如,“得分”触发“足球运动员踢球”)。在NExT-QA和Causal-VidQA数据集上,CaKE-LM的零样本CVidQA准确率显著优于传统方法4%至6%。我们还进行了全面分析,并为未来研究提供了关键发现。