It is challenging to perform question-answering over complex, multimodal content such as television clips. This is in part because current video-language models rely on single-modality reasoning, have lowered performance on long inputs, and lack interpetability. We propose TV-TREES, the first multimodal entailment tree generator. TV-TREES serves as an approach to video understanding that promotes interpretable joint-modality reasoning by producing trees of entailment relationships between simple premises directly entailed by the videos and higher-level conclusions. We then introduce the task of multimodal entailment tree generation to evaluate the reasoning quality of such methods. Our method's experimental results on the challenging TVQA dataset demonstrate intepretable, state-of-the-art zero-shot performance on full video clips, illustrating a best-of-both-worlds contrast to black-box methods.
翻译:对电视片段等复杂多模态内容进行问答具有挑战性,这在一定程度上是因为当前的视频语言模型依赖单模态推理、对长输入性能下降且缺乏可解释性。我们提出TV-TREES——首个多模态蕴含树生成器。TV-TREES通过构建蕴含树(描述视频中直接蕴含的简单前提与高层结论之间的推理关系),为视频理解提供了一种促进可解释联合模态推理的方法。随后我们引入多模态蕴含树生成任务以评估此类方法的推理质量。在具有挑战性的TVQA数据集上的实验结果表明,该方法在全视频片段上实现了可解释的零样本最先进性能,与黑盒方法形成了"两全其美"的对比。