IF-VidCap：视频描述模型能否遵循指令？ (IF-VidCap: Can Video Caption Models Follow Instructions?)

Shihao Li,Yuanxing Zhang,Jiangtao Wu,Zhide Lei,Yiwen He,Runzhe Wen,Chenxi Liao,Chengkang Jiang,An Ping,Shuo Gao,Suhan Wang,Zhaozhou Bian,Zijun Zhou,Jingyi Xie,Jiayi Zhou,Jing Wang,Yifan Yao,Weihao Xie,Yingshui Tan,Yanghai Wang,Qianqian Xie,Zhaoxiang Zhang,Jiaheng Liu

from arxiv, https://github.com/NJU-LINK/IF-VidCap

Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlooking instruction-following capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples. Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our comprehensive evaluation of over 20 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the performance gap is closing, with top-tier open-source solutions now achieving near-parity. Furthermore, we find that models specialized for dense captioning underperform general-purpose MLLMs on complex instructions, indicating that future work should simultaneously advance both descriptive richness and instruction-following fidelity.

翻译：尽管多模态大语言模型（MLLMs）在视频描述任务中已展现出卓越能力，但实际应用通常需要生成遵循特定用户指令的描述，而非详尽且无约束的概述。然而，当前基准测试主要评估描述的全面性，很大程度上忽视了模型的指令遵循能力。为填补这一空白，我们提出了IF-VidCap——一个用于评估可控视频描述的新基准，其中包含1,400个高质量样本。与现有的视频描述或通用指令遵循基准不同，IF-VidCap采用了一个系统化评估框架，从两个维度对描述进行评判：格式正确性与内容正确性。我们对超过20个主流模型进行的全面评估揭示了一个微妙的现状：尽管专有模型仍保持主导地位，但性能差距正在缩小，顶尖开源解决方案现已达到近乎相当的水平。此外，我们发现专为密集描述设计的模型在处理复杂指令时表现不及通用MLLMs，这表明未来的工作应同时推进描述丰富度与指令遵循忠实度的发展。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日