Recent approaches have shown impressive proficiency in extracting and leveraging parametric knowledge from Large-Language Models (LLMs) and Vision-Language Models (VLMs). In this work, we consider how we can improve the identification and retrieval of videos related to complex real-world events by automatically extracting latent parametric knowledge about those events. We present Q2E: a Query-to-Event decomposition method for zero-shot multilingual text-to-video retrieval, adaptable across datasets, domains, LLMs, or VLMs. Our approach demonstrates that we can enhance the understanding of otherwise overly simplified human queries by decomposing the query using the knowledge embedded in LLMs and VLMs. We additionally show how to apply our approach to both visual and speech-based inputs. To combine this varied multimodal knowledge, we adopt entropy-based fusion scoring for zero-shot fusion. Through evaluations on two diverse datasets and multiple retrieval metrics, we demonstrate that Q2E outperforms several state-of-the-art baselines. Our evaluation also shows that integrating audio information can significantly improve text-to-video retrieval. We have released code and data for future research.
翻译:近期研究显示,从大语言模型和视觉语言模型中提取并利用参数化知识的能力已取得显著进展。本文探讨如何通过自动提取复杂现实世界事件的潜在参数化知识,以改进相关视频的识别与检索。我们提出Q2E:一种适用于零样本多语言文本到视频检索的查询到事件分解方法,可适配不同数据集、领域、大语言模型或视觉语言模型。该方法证明,通过利用大语言模型和视觉语言模型中嵌入的知识对查询进行分解,能够增强对原本过度简化的人类查询的理解。我们进一步展示了如何将本方法应用于视觉和语音输入。为融合这些多样化的多模态知识,我们采用基于熵的融合评分机制进行零样本融合。通过在两个不同数据集和多种检索指标上的评估,我们证明Q2E在性能上优于多个先进基线方法。评估结果还表明,整合音频信息能显著提升文本到视频检索的效果。我们已公开相关代码与数据以供后续研究。