Our objective is to build an embedding model that captures the nuanced relationship between a search query and candidate videos. We cover three aspects of nuanced retrieval: (i) temporal, (ii) negation, and (iii) multimodal. For temporal nuance, we consider chiral actions that need distinguishing between temporally opposite actions like "opening a door" vs. "closing a door". For negation, we consider queries with negators such as "not", "none" that allow user to specify what they do not want. For multimodal nuance, we consider the task of composed retrieval where the query comprises a video along with a text edit instruction. The goal is to develop a unified embedding model that handles such nuances effectively. To that end, we repurpose a Multimodal Large Language Model (MLLM) trained to generate text into an embedding model. We fine-tune it with a contrastive loss on text alone with carefully sampled hard negatives that instill the desired nuances in the learned embedding space. Despite the text-only training, our method achieves state of the art performance on all benchmarks for nuanced video retrieval. We also analyze how this improvement is achieved, and show that text-only training reduces the modality gap between text and video embeddings leading to better organization of the embedding space.
翻译:我们的目标是构建一个能够捕捉搜索查询与候选视频之间细微关系的嵌入模型。我们涵盖精细化检索的三个维度:(i) 时序、(ii) 否定、(iii) 多模态。针对时序维度,我们考虑需要区分时序相反动作(如"开门"与"关门")的手性动作;针对否定维度,我们处理包含"不"、"无"等否定词的查询,使用户能够指定其不期望的内容;针对多模态维度,我们研究复合检索任务,其中查询由视频和文本编辑指令共同构成。目标是开发一个能有效处理此类细微差异的统一嵌入模型。为此,我们将训练用于生成文本的多模态大语言模型(MLLM)改造为嵌入模型,通过文本仅输入的对比学习损失进行微调,并精心采样能够强化学习空间中所需细微差异的困难负样本。尽管仅使用文本训练,我们的方法仍在所有精细化视频检索基准上达到最先进性能。我们还分析了性能提升的机理,证明纯文本训练能缩小文本与视频嵌入之间的模态差距,从而优化嵌入空间的组织结构。