Videos inherently contain multiple modalities, including visual events, text overlays, sounds, and speech, all of which are important for retrieval. However, state-of-the-art multimodal language models like VAST and LanguageBind are built on vision-language models (VLMs), and thus overly prioritize visual signals. Retrieval benchmarks further reinforce this bias by focusing on visual queries and neglecting other modalities. We create a search system MMMORRF that extracts text and features from both visual and audio modalities and integrates them with a novel modality-aware weighted reciprocal rank fusion. MMMORRF is both effective and efficient, demonstrating practicality in searching videos based on users' information needs instead of visual descriptive queries. We evaluate MMMORRF on MultiVENT 2.0 and TVR, two multimodal benchmarks designed for more targeted information needs, and find that it improves nDCG@20 by 81% over leading multimodal encoders and 37% over single-modality retrieval, demonstrating the value of integrating diverse modalities.
翻译:视频本质上包含多种模态,包括视觉事件、文本叠加层、声音和语音,所有这些对于检索都至关重要。然而,像VAST和LanguageBind这样的最先进多模态语言模型建立在视觉语言模型(VLM)之上,因此过度优先考虑视觉信号。检索基准进一步强化了这种偏见,它们侧重于视觉查询而忽视了其他模态。我们创建了一个搜索系统MMMORRF,该系统从视觉和音频模态中提取文本和特征,并通过一种新颖的模态感知加权互逆排序融合将它们整合起来。MMMORRF既有效又高效,展示了基于用户信息需求而非视觉描述性查询来搜索视频的实用性。我们在MultiVENT 2.0和TVR这两个为更针对性信息需求设计的多模态基准上评估了MMMORRF,发现其nDCG@20比领先的多模态编码器提高了81%,比单模态检索提高了37%,这证明了整合多样化模态的价值。