Everyday news coverage has shifted from traditional broadcasts towards a wide range of presentation formats such as first-hand, unedited video footage. Datasets that reflect the diverse array of multimodal, multilingual news sources available online could be used to teach models to benefit from this shift, but existing news video datasets focus on traditional news broadcasts produced for English-speaking audiences. We address this limitation by constructing MultiVENT, a dataset of multilingual, event-centric videos grounded in text documents across five target languages. MultiVENT includes both news broadcast videos and non-professional event footage, which we use to analyze the state of online news videos and how they can be leveraged to build robust, factually accurate models. Finally, we provide a model for complex, multilingual video retrieval to serve as a baseline for information retrieval using MultiVENT.
翻译:日常新闻报道已从传统广播转向第一手未经编辑的视频片段等多种呈现形式。反映在线多模态多语言新闻来源多样性的数据集,可用于训练模型以受益于这一转变,但现有新闻视频数据集主要面向英语受众的传统新闻广播。为解决这一局限,我们构建了MultiVENT数据集——一个覆盖五种目标语言的基于文本的多语言事件中心视频数据集。MultiVENT既包含新闻广播视频,也包含非专业事件素材,我们借此分析在线新闻视频的现状,以及如何利用它们构建稳健、事实准确的模型。最后,我们提出一个复杂多语言视频检索模型,作为基于MultiVENT开展信息检索的基线方法。