As online video content rapidly grows, the task of text-video retrieval (TVR) becomes increasingly important. A key challenge in TVR is the information asymmetry between video and text: videos are inherently richer in information, while their textual descriptions often capture only fragments of this complexity. This paper introduces a novel, data-centric framework to bridge this gap by enriching textual representations to better match the richness of video content. During training, videos are segmented into event-level clips and captioned to ensure comprehensive coverage. During retrieval, a large language model (LLM) generates semantically diverse queries to capture a broader range of possible matches. To enhance retrieval efficiency, we propose a query selection mechanism that identifies the most relevant and diverse queries, reducing computational cost while improving accuracy. Our method achieves state-of-the-art results across multiple benchmarks, demonstrating the power of data-centric approaches in addressing information asymmetry in TVR. This work paves the way for new research focused on leveraging data to improve cross-modal retrieval.
翻译:随着在线视频内容的快速增长,文本-视频检索任务变得日益重要。该任务的一个核心挑战在于视频与文本之间的信息不对称:视频本身信息更为丰富,而其文本描述往往仅捕捉了这种复杂性的片段。本文提出了一种新颖的、以数据为中心的框架,通过丰富文本表示以更好地匹配视频内容的丰富性,从而弥合这一鸿沟。在训练阶段,视频被分割为事件级别的片段并配以字幕,以确保全面的覆盖。在检索阶段,一个大语言模型生成语义多样化的查询,以捕获更广泛的潜在匹配项。为了提高检索效率,我们提出了一种查询选择机制,该机制能识别最相关且多样化的查询,从而在提高准确性的同时降低计算成本。我们的方法在多个基准测试中取得了最先进的结果,证明了以数据为中心的方法在解决文本-视频检索中信息不对称问题上的有效性。这项工作为专注于利用数据改进跨模态检索的新研究铺平了道路。