The goal of video summarization is to automatically shorten videos such that it conveys the overall story without losing relevant information. In many application scenarios, improper video summarization can have a large impact. For example in forensics, the quality of the generated video summary will affect an investigator's judgment while in journalism it might yield undesired bias. Because of this, modeling explainability is a key concern. One of the best ways to address the explainability challenge is to uncover the causal relations that steer the process and lead to the result. Current machine learning-based video summarization algorithms learn optimal parameters but do not uncover causal relationships. Hence, they suffer from a relative lack of explainability. In this work, a Causal Explainer, dubbed Causalainer, is proposed to address this issue. Multiple meaningful random variables and their joint distributions are introduced to characterize the behaviors of key components in the problem of video summarization. In addition, helper distributions are introduced to enhance the effectiveness of model training. In visual-textual input scenarios, the extra input can decrease the model performance. A causal semantics extractor is designed to tackle this issue by effectively distilling the mutual information from the visual and textual inputs. Experimental results on commonly used benchmarks demonstrate that the proposed method achieves state-of-the-art performance while being more explainable.
翻译:视频摘要的目标是自动缩短视频长度,使其在不丢失相关信息的前提下传达整体故事脉络。在许多应用场景中,不当的视频摘要可能产生重大影响。例如在法医学领域,生成的视频摘要质量会影响调查人员的判断;而在新闻学中则可能带来非预期的偏见。因此,建模可解释性是一个关键问题。应对可解释性挑战的最佳途径之一是揭示驱动过程并导向结果的因果关联。当前基于机器学习的视频摘要算法虽能学习最优参数,但未能揭示因果关系,因而存在可解释性相对欠缺的问题。本文提出一种名为Causalainer的因果解释器来应对这一挑战。通过引入多个有意义的随机变量及其联合分布,我们刻画了视频摘要问题中关键组件的行为特征。此外,还引入了辅助分布来提升模型训练效果。在视觉-文本输入场景中,额外输入可能降低模型性能,为此我们设计了因果语义提取器,通过有效蒸馏视觉与文本输入中的互信息来解决该问题。在常用基准数据集上的实验结果表明,所提方法在保持更优可解释性的同时,实现了最先进的性能水平。