Building a Precise Video Language with Human-AI Oversight

Zhiqiu Lin,Chancharik Mitra,Siyuan Cen,Isaac Li,Yuhan Huang,Yu Tong Tiffany Ling,Hewei Wang,Irene Pi,Shihang Zhu,Ryan Rao,George Liu,Jiaxi Li,Ruojin Li,Yili Han,Yilun Du,Deva Ramanan

from arxiv, CVPR 2026 Highlight. Project page: https://linzhiqiu.github.io/papers/chai/

Video-language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce CHAI (Critique-based Human-AI Oversight), a framework where trained experts critique and revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through SFT, DPO, and inference-time scaling. Our ablations show that critique quality in precision, recall, and constructiveness, ensured by our oversight framework, directly governs downstream performance. With modest expert supervision, the resulting model outperforms closed-source models such as Gemini-3.1-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of up to 400 words, achieving finer control over cinematography including camera motion, angle, lens, focus, point of view, and framing. Our results show that precise specification and human-AI oversight are key to professional-level video understanding and generation. Data and code are available on our project page: https://linzhiqiu.github.io/papers/chai/

翻译：视频语言模型通过自然语言学习推理动态视觉世界。我们提出了一套开放的数据集、基准测试和可扩展监督方案，用于实现精准视频描述。首先，我们定义了一个结构化规范来刻画主体、场景、运动、空间和摄像机动态，该规范基于数百个精心定义的视觉基元，这些基元与专业视频创作者（如电影制作人）共同开发。其次，为策展高质量描述，我们提出CHAI（基于批判性评估的人机协同监督）框架——受训专家对模型生成的预描述进行批判性评估与修订，形成改进后的后描述。这种分工通过将文本生成任务转移给模型，使人类得以专注验证，从而提升了标注准确性与效率。此外，预描述与后描述之间的批判性评估及偏好信息，为通过SFT、DPO和推理时扩展技术改进开源模型（Qwen3-VL）的描述生成、奖励建模和批判生成能力提供了丰富监督。消融实验表明，由我们的监督框架保障的批判质量（精确性、召回率与建设性）直接决定了下游性能。在适度专家监督下，所得模型性能超越Gemini-3.1-Pro等闭源模型。最后，我们将该方法应用于大规模专业视频（如电影、广告、游戏）的重新标注，并微调Wan等视频生成模型，使其能更好地遵循长达400词的详细提示，实现对包括摄像机运动、角度、镜头、焦距、视角和取景在内的电影级控制。实验结果表明，精准规范与人机协同监督是实现专业级视频理解与生成的关键。数据和代码已发布于项目页面：https://linzhiqiu.github.io/papers/chai/