Our objective is to build an embedding model that captures the nuanced relationship between a search query and candidate videos. We cover three aspects of nuanced retrieval: (i) temporal, (ii) negation, and (iii) multimodal. For temporal nuance, we consider chiral actions that need distinguishing between temporally opposite actions like "opening a door" vs. "closing a door". For negation, we consider queries with negators such as "not", "none" that allow user to specify what they do not want. For multimodal nuance, we consider the task of composed retrieval where the query comprises a video along with a text edit instruction. The goal is to develop a unified embedding model that handles such nuances effectively. To that end, we repurpose a Multimodal Large Language Model (MLLM) trained to generate text into an embedding model. We fine-tune it with a contrastive loss on text alone with carefully sampled hard negatives that instill the desired nuances in the learned embedding space. Despite the text-only training, our method achieves state of the art performance on all benchmarks for nuanced video retrieval. We also analyze how this improvement is achieved, and show that text-only training reduces the modality gap between text and video embeddings leading to better organization of the embedding space.
翻译:我们的目标是构建一个能够捕捉搜索查询与候选视频之间细微关系的嵌入模型。我们涵盖精细化检索的三个维度:(i) 时间维度,(ii) 否定维度,以及(iii) 多模态维度。针对时间维度的精细化,我们考虑需要区分时间对立动作(如"开门"与"关门")的手性动作。针对否定维度,我们处理包含"不"、"无"等否定词的查询,使用户能够指定其不期望的内容。针对多模态维度的精细化,我们研究组合式检索任务,其中查询由视频及文本编辑指令共同构成。目标是开发一个能够有效处理此类细微差异的统一嵌入模型。为此,我们将原本针对文本生成任务训练的多模态大语言模型改造为嵌入模型,通过仅使用文本数据的对比损失进行微调,并精心采样硬负样本以在嵌入空间中注入所需的语义细微差异。尽管仅使用文本训练,我们的方法在所有精细化视频检索基准测试中均达到了最先进性能。我们还分析了性能提升的原因,表明文本训练缩小了文本与视频嵌入之间的模态差距,从而实现了更优的嵌入空间组织。