Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

Proactive and real-time interactive experiences are essential for human-like AI companions, yet face three key challenges: (1) achieving low-latency inference under continuous streaming inputs, (2) autonomously deciding when to respond, and (3) controlling both quality and quantity of generated content to meet real-time constraints. In this work, we instantiate AI companions through two gaming scenarios, commentator and guide, selected for their suitability for automatic evaluation. We introduce the Live Gaming Benchmark, a large-scale dataset with three representative scenarios: solo commentary, co-commentary, and user guidance, and present Proact-VL, a general framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction. Extensive experiments show Proact-VL achieves superior response latency and quality while maintaining strong video understanding capabilities, demonstrating its practicality for real-time interactive applications.

翻译：主动、实时的交互体验是实现类人AI伴侣的关键，却面临三大挑战：(1) 在连续流式输入下实现低延迟推理；(2) 自主决定何时响应；(3) 控制生成内容的质量与数量以满足实时约束。在本工作中，我们通过解说员和向导两种游戏场景实例化AI伴侣，这两种场景因适合自动评测而被选中。我们引入了Live Gaming Benchmark——一个包含独白解说、双人解说和用户引导三种代表性场景的大规模数据集，并提出了Proact-VL，这是一个通用框架，可将多模态语言模型塑造为具备类人环境感知与交互能力的主动式实时交互智能体。大量实验表明，Proact-VL在保持强大视频理解能力的同时，实现了优越的响应延迟与质量，证明了其在实时交互应用中的实用性。

相关内容

关注 7111

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

构建面向终端的 AI 编程智能体：脚手架、测试环境、上下文工程及实践经验

专知会员服务

26+阅读 · 3月8日

迈向个性化大语言模型驱动的智能体：基础、评估与未来方向

专知会员服务

29+阅读 · 2月27日

《AI作战：将人机协作集成至实时、虚拟与建构环境（LVC）的建模与仿真》

专知会员服务

42+阅读 · 2025年9月23日

面向机器人操作的基于大型视觉‑语言模型（VLM）的视觉‑语言‑动作（VLA）模型综述

专知会员服务

34+阅读 · 2025年8月19日