Video understanding, including video captioning and retrieval, is still a great challenge for video-language models (VLMs). The existing video retrieval and caption benchmarks only include short descriptions, limits their ability of detailed video understanding evaluation. To address this problem, we present CaReBench, a testing benchmark for fine-grained video captioning and retrieval with 1,000 high-quality pairs of videos and human-annotated detailed captions. Uniquely, it provides manually separated spatial annotations and temporal annotations for each video. Based on this design, we introduce two evaluation metrics, ReBias and CapST, specifically tailored for video retrieval and video captioning tasks, respectively. These metrics enable a comprehensive investigation into the spatial and temporal biases inherent in VLMs. In addition, to handle both video retrieval and video captioning tasks in a unified framework, we develop a simple baseline based on a Multimodal Language Model (MLLM). By implementing a two-stage Supervised Fine-Tuning (SFT), we fully unlock the potential of MLLM, enabling it not only to generate detailed video descriptions but also to extract video features. Surprisingly, experimental results demonstrate that, compared to the CLIP-based models designed for retrieval and the popular MLLMs skilled in video captioning, our baseline shows competitive performance in both fine-grained video retrieval and video detailed captioning.
翻译:视频理解,包括视频描述与检索,对于视频-语言模型(VLMs)而言仍然是一个巨大的挑战。现有的视频检索与描述基准仅包含简短描述,限制了其评估详细视频理解的能力。为解决此问题,我们提出了CaReBench,这是一个用于细粒度视频描述与检索的测试基准,包含1000个高质量的视频与人工标注的详细描述对。其独特之处在于,它为每个视频提供了手动分离的空间标注和时间标注。基于此设计,我们引入了两个评估指标:ReBias和CapST,分别专门针对视频检索和视频描述任务。这些指标能够全面探究VLMs中固有的空间和时间偏差。此外,为在一个统一框架中处理视频检索和视频描述两项任务,我们基于一个多模态语言模型(MLLM)开发了一个简单的基线模型。通过实施两阶段监督微调(SFT),我们充分释放了MLLM的潜力,使其不仅能生成详细的视频描述,还能提取视频特征。令人惊讶的是,实验结果表明,与专为检索设计的基于CLIP的模型以及擅长视频描述的流行MLLMs相比,我们的基线模型在细粒度视频检索和视频详细描述方面均展现出具有竞争力的性能。