Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlooking instruction-following capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples. Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our comprehensive evaluation of over 20 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the performance gap is closing, with top-tier open-source solutions now achieving near-parity. Furthermore, we find that models specialized for dense captioning underperform general-purpose MLLMs on complex instructions, indicating that future work should simultaneously advance both descriptive richness and instruction-following fidelity.
翻译:尽管多模态大语言模型(MLLMs)在视频描述任务中已展现出卓越能力,但实际应用通常需要生成遵循特定用户指令的描述,而非详尽且无约束的概述。然而,当前基准测试主要评估描述的全面性,很大程度上忽视了模型的指令遵循能力。为填补这一空白,我们提出了IF-VidCap——一个用于评估可控视频描述的新基准,其中包含1,400个高质量样本。与现有的视频描述或通用指令遵循基准不同,IF-VidCap采用了一个系统化评估框架,从两个维度对描述进行评判:格式正确性与内容正确性。我们对超过20个主流模型进行的全面评估揭示了一个微妙的现状:尽管专有模型仍保持主导地位,但性能差距正在缩小,顶尖开源解决方案现已达到近乎相当的水平。此外,我们发现专为密集描述设计的模型在处理复杂指令时表现不及通用MLLMs,这表明未来的工作应同时推进描述丰富度与指令遵循忠实度的发展。