Long videos contain a vast amount of information, making video-text retrieval an essential and challenging task in multimodal learning. However, existing benchmarks suffer from limited video duration, low-quality captions, and coarse annotation granularity, which hinder the evaluation of advanced video-text retrieval methods. To address these limitations, we introduce LoVR, a benchmark specifically designed for long video-text retrieval. LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions. To overcome the issue of poor machine-generated annotations, we propose an efficient caption generation framework that integrates VLM automatic generation, caption quality scoring, and dynamic refinement. This pipeline improves annotation accuracy while maintaining scalability. Furthermore, we introduce a semantic fusion method to generate coherent full-video captions without losing important contextual information. Our benchmark introduces longer videos, more detailed captions, and a larger-scale dataset, presenting new challenges for video understanding and retrieval. Extensive experiments on various advanced embedding models demonstrate that LoVR is a challenging benchmark, revealing the limitations of current approaches and providing valuable insights for future research. We release the code and dataset link at https://lovrbench.github.io/
翻译:长视频蕴含海量信息,使得视频-文本检索成为多模态学习中至关重要且极具挑战性的任务。然而,现有基准普遍存在视频时长有限、字幕质量低下及标注粒度粗糙等问题,制约了先进视频-文本检索方法的评估。为突破这些局限,我们提出了专门针对长视频-文本检索的基准数据集LoVR。该数据集包含467个长视频及超过40,804个带有高质量字幕的细粒度片段。为克服机器生成标注质量不佳的问题,我们提出了一种融合视觉语言模型自动生成、字幕质量评分与动态优化的高效字幕生成框架。该流程在保证可扩展性的同时显著提升了标注准确性。此外,我们引入语义融合方法以生成连贯的完整视频字幕,同时避免重要上下文信息的丢失。本基准通过引入更长视频、更精细字幕及更大规模数据集,为视频理解与检索任务提出了新的挑战。在多种先进嵌入模型上的大量实验表明,LoVR是一个具有挑战性的基准,既揭示了现有方法的局限性,也为未来研究提供了宝贵洞见。代码与数据集链接发布于https://lovrbench.github.io/