Video production workflows offer a rich and demanding arena for evaluating multimodal AI agents: they require composite capabilities across text, image, audio, and video understanding, along with long-horizon planning, and tool use. To this end, we introduce AgenticVBench, a benchmark of 100 agentic tasks across 4 task families spanning the real world post-production workflow, constructed from real production workflows contributed by 20 industry experts averaging 6 years of professional experience. Tasks are paired with evaluation specifications that combine programmatic verifiers and expert rubrics. We evaluate frontier vision-language models (VLMs) with both vendor-native and open-source harnesses. The best evaluated agent stack barely crosses 30%, far below human expert performance on the same tasks. We further find that the choice of harness substantially affects model behavior, including scores, tool-use patterns, and failure modes. AgenticVBench provides a foundation for diagnosing and improving both models and harnesses for agentic video production. Benchmark website: https://agenticvbench.com.
翻译:视频制作工作流为评估多模态AI智能体提供了丰富而严苛的试验场:它要求智能体具备跨文本、图像、音频和视频理解的复合能力,同时兼具长期规划与工具使用能力。为此,我们提出AgenticVBench基准——一个包含100项智能体任务、覆盖真实世界后期制作工作流中4个任务类别的基准数据集,这些任务源自20位平均从业经验6年的行业专家提供的真实制作流程。每项任务均配有结合程序化验证器与专家评分标准的评估规范。我们通过厂商原生框架和开源框架,评估了前沿视觉语言模型(VLM)。最优智能体堆栈的得分勉强超过30%,远低于人类专家在同类任务上的表现。我们进一步发现,评估框架的选择会显著影响模型行为,包括评分、工具使用模式及故障模式。AgenticVBench为诊断和改进面向智能体视频制作的模型与框架奠定了基础。基准网站:https://agenticvbench.com。