Existing text-video retrieval solutions are, in essence, discriminant models focused on maximizing the conditional likelihood, i.e., p(candidates|query). While straightforward, this de facto paradigm overlooks the underlying data distribution p(query), which makes it challenging to identify out-of-distribution data. To address this limitation, we creatively tackle this task from a generative viewpoint and model the correlation between the text and the video as their joint probability p(candidates,query). This is accomplished through a diffusion-based text-video retrieval framework (DiffusionRet), which models the retrieval task as a process of gradually generating joint distribution from noise. During training, DiffusionRet is optimized from both the generation and discrimination perspectives, with the generator being optimized by generation loss and the feature extractor trained with contrastive loss. In this way, DiffusionRet cleverly leverages the strengths of both generative and discriminative methods. Extensive experiments on five commonly used text-video retrieval benchmarks, including MSRVTT, LSMDC, MSVD, ActivityNet Captions, and DiDeMo, with superior performances, justify the efficacy of our method. More encouragingly, without any modification, DiffusionRet even performs well in out-domain retrieval settings. We believe this work brings fundamental insights into the related fields. Code will be available at https://github.com/jpthu17/DiffusionRet.
翻译:现有文本-视频检索方案本质上属于判别模型,侧重于最大化条件似然p(候选|查询)。这种事实标准范式虽直观,却忽略了底层数据分布p(query),导致难以识别分布外数据。为突破这一局限,我们创新性地从生成视角处理该任务,将文本与视频间的关联建模为联合概率p(候选,查询)。具体通过基于扩散的文本-视频检索框架(DiffusionRet)实现,该框架将检索任务建模为从噪声逐步生成联合分布的过程。训练阶段,DiffusionRet从生成与判别双视角进行优化:生成器通过生成损失优化,特征提取器借助对比损失训练。由此巧妙融合生成式与判别式方法的优势。在MSRVTT、LSMDC、MSVD、ActivityNet Captions及DiDeMo五个常用文本-视频检索基准上的大量实验表明,本方法性能优越。更令人振奋的是,无需任何修改,DiffusionRet在域外检索场景中依然表现优异。我们相信本研究为相关领域带来了基础性洞见。代码将发布于https://github.com/jpthu17/DiffusionRet。