It is challenging to perform question-answering over complex, multimodal content such as television clips. This is in part because current video-language models rely on single-modality reasoning, have lowered performance on long inputs, and lack interpetability. We propose TV-TREES, the first multimodal entailment tree generator. TV-TREES serves as an approach to video understanding that promotes interpretable joint-modality reasoning by producing trees of entailment relationships between simple premises directly entailed by the videos and higher-level conclusions. We then introduce the task of multimodal entailment tree generation to evaluate the reasoning quality of such methods. Our method's experimental results on the challenging TVQA dataset demonstrate intepretable, state-of-the-art zero-shot performance on full video clips, illustrating a best of both worlds contrast to black-box methods.
翻译:在电视片段等复杂多模态内容上进行问答推理具有挑战性。这在一定程度上是因为当前视频-语言模型依赖单模态推理、对长输入性能下降且缺乏可解释性。我们提出TV-TREES,首个多模态蕴含树生成器。TV-TREES作为一种视频理解方法,通过构建介于视频直接蕴含的简单前提与高层结论之间的蕴含关系树,促进可解释的联合模态推理。进而引入多模态蕴含树生成任务以评估此类方法的推理质量。该方法在具有挑战性的TVQA数据集上的实验结果表明,其在完整视频片段上实现了可解释的零样本最优性能,展现出与黑盒方法相比兼具两者优势的特性。