In this paper we address the task of summarizing television shows, which touches key areas in AI research: complex reasoning, multiple modalities, and long narratives. We present a modular approach where separate components perform specialized sub-tasks which we argue affords greater flexibility compared to end-to-end methods. Our modules involve detecting scene boundaries, reordering scenes so as to minimize the number of cuts between different events, converting visual information to text, summarizing the dialogue in each scene, and fusing the scene summaries into a final summary for the entire episode. We also present a new metric, PREFS (Precision and Recall Evaluation of Summary FactS), to measure both precision and recall of generated summaries, which we decompose into atomic facts. Tested on the recently released SummScreen3D dataset Papalampidi and Lapata (2023), our method produces higher quality summaries than comparison models, as measured with ROUGE and our new fact-based metric.
翻译:本文研究电视节目摘要任务,该任务涉及人工智能研究的核心领域:复杂推理、多模态处理及长叙事分析。我们提出一种模块化方法,通过独立组件执行专业化子任务——相较于端到端方法,这种设计具有更强的灵活性。这些模块包括:场景边界检测、场景重排(最小化不同事件之间的镜头切换)、视觉信息文本化转换、各场景对话摘要生成,以及将场景摘要融合为整集最终摘要。我们还提出新评估指标PREFS(摘要事实的精确率与召回率评估),通过将摘要分解为原子事实来衡量其精确率与召回率。在近期发布的SummScreen3D数据集(Papalampidi and Lapata, 2023)上测试表明,我们的方法在ROUGE指标及基于事实的新指标上,均能生成比对比模型质量更高的摘要。