Existing Causal-Why Video Question Answering (VideoQA) models often struggle with higher-order reasoning, relying on opaque, monolithic pipelines that entangle video understanding, causal inference, and answer generation. These black-box approaches offer limited interpretability and tend to depend on shallow heuristics. We propose a novel, modular paradigm that explicitly decouples causal reasoning from answer generation, introducing natural language causal chains as interpretable intermediate representations. Inspired by human cognitive models, these structured cause-effect sequences bridge low-level video content with high-level causal reasoning, enabling transparent and logically coherent inference. Our two-stage architecture comprises a Causal Chain Extractor (CCE) that generates causal chains from video-question pairs, and a Causal Chain-Driven Answerer (CCDA) that derives answers grounded in these chains. To address the lack of annotated reasoning traces, we introduce a scalable method for generating accurate causal chains from existing datasets. We construct human verified causal chains for 46K samples. We also propose CauCo, a new evaluation metric for causality-oriented captioning. Experiments on three large-scale benchmarks demonstrate that our approach not only outperforms state-of-the-art models, but also yields substantial gains in explainability, user trust, and generalization -- positioning the CCE as a reusable causal reasoning engine across diverse domains. Project page: https://paritoshparmar.github.io/chainreaction/
翻译:现有因果视频问答模型常因依赖不透明的整体化流程而难以进行高阶推理,这些流程将视频理解、因果推断与答案生成相耦合。此类黑箱方法可解释性有限,且往往依赖于浅层启发式规则。我们提出一种新颖的模块化范式,通过引入自然语言因果链作为可解释的中间表征,显式地将因果推理与答案生成解耦。受人类认知模型启发,这种结构化的因果序列衔接了低层视频内容与高层因果推理,实现了透明且逻辑连贯的推断。我们的两阶段架构包含:从视频-问题对生成因果链的因果链提取器,以及基于这些因果链推导答案的因果链驱动应答器。针对标注推理轨迹的缺失,我们提出一种可扩展方法,能够从现有数据集中生成精确的因果链。我们为4.6万个样本构建了经人工验证的因果链,同时提出面向因果性描述的新评估指标CauCo。在三个大规模基准测试上的实验表明,我们的方法不仅性能超越现有最优模型,还在可解释性、用户信任度与泛化能力方面取得显著提升——使因果链提取器成为跨领域可复用的因果推理引擎。项目页面:https://paritoshparmar.github.io/chainreaction/