Foundation models are used for many real-world applications involving language generation from temporally-ordered multimodal events. In this work, we study the ability of models to identify the most important sub-events in a video, which is a fundamental prerequisite for narrating or summarizing multimodal events. Specifically, we focus on football games and evaluate models on their ability to distinguish between important and non-important sub-events in a game. To this end, we construct a new dataset by leveraging human preferences for importance implicit in football game highlight reels, without any additional annotation costs. Using our dataset, which we will publicly release to the community, we compare several state-of-the-art multimodal models and show that they are not far from chance level performance. Analyses of models beyond standard evaluation metrics reveal their tendency to rely on a single dominant modality and their ineffectiveness in synthesizing necessary information from multiple sources. Our findings underline the importance of modular architectures that can handle sample-level heterogeneity in multimodal data and the need for complementary training procedures that can maximize cross-modal synergy.
翻译:基础模型被广泛应用于涉及从时序排序的多模态事件生成语言的现实应用中。本研究探讨模型识别视频中最重要子事件的能力,这是叙述或总结多模态事件的基本前提。具体而言,我们聚焦足球比赛场景,评估模型区分比赛中重要与非重要子事件的能力。为此,我们通过利用足球比赛集锦中隐含的人类重要性偏好构建新数据集,无需额外标注成本。基于我们将公开给研究社区的数据集,我们比较了若干先进多模态模型,发现其性能与随机水平相差无几。超越标准评估指标的分析揭示了模型倾向于依赖单一主导模态,且在综合多源必要信息方面效果有限。我们的研究结果强调了处理多模态数据样本级异质性的模块化架构的重要性,以及需要能够最大化跨模态协同的互补训练方法。