AgenticVBench: Can AI Agents Complete Real-World Post-Production Tasks?

Video production workflows offer a rich and demanding arena for evaluating multimodal AI agents: they require composite capabilities across text, image, audio, and video understanding, along with long-horizon planning, and tool use. To this end, we introduce AgenticVBench, a benchmark of 100 agentic tasks across 4 task families spanning the real world post-production workflow, constructed from real production workflows contributed by 20 industry experts averaging 6 years of professional experience. Tasks are paired with evaluation specifications that combine programmatic verifiers and expert rubrics. We evaluate frontier vision-language models (VLMs) with both vendor-native and open-source harnesses. The best evaluated agent stack barely crosses 30%, far below human expert performance on the same tasks. We further find that the choice of harness substantially affects model behavior, including scores, tool-use patterns, and failure modes. AgenticVBench provides a foundation for diagnosing and improving both models and harnesses for agentic video production. Benchmark website: https://agenticvbench.com.

翻译：视频制作工作流为评估多模态AI智能体提供了丰富而严苛的试验场：它要求智能体具备跨文本、图像、音频和视频理解的复合能力，同时兼具长期规划与工具使用能力。为此，我们提出AgenticVBench基准——一个包含100项智能体任务、覆盖真实世界后期制作工作流中4个任务类别的基准数据集，这些任务源自20位平均从业经验6年的行业专家提供的真实制作流程。每项任务均配有结合程序化验证器与专家评分标准的评估规范。我们通过厂商原生框架和开源框架，评估了前沿视觉语言模型（VLM）。最优智能体堆栈的得分勉强超过30%，远低于人类专家在同类任务上的表现。我们进一步发现，评估框架的选择会显著影响模型行为，包括评分、工具使用模式及故障模式。AgenticVBench为诊断和改进面向智能体视频制作的模型与框架奠定了基础。基准网站：https://agenticvbench.com。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

Agent Harness综述：大模型智能体执行器工程全景

专知会员服务

21+阅读 · 5月28日

构建面向终端的 AI 编程智能体：脚手架、测试环境、上下文工程及实践经验

专知会员服务

25+阅读 · 3月8日

伯克利最新《智能体 AI (Agentic AI)》课程

专知会员服务

49+阅读 · 3月1日

智能体 AI (Agentic AI) 的新进展：回归初心，预见未来

专知会员服务

29+阅读 · 1月2日