Proactive and real-time interactive experiences are essential for human-like AI companions, yet face three key challenges: (1) achieving low-latency inference under continuous streaming inputs, (2) autonomously deciding when to respond, and (3) controlling both quality and quantity of generated content to meet real-time constraints. In this work, we instantiate AI companions through two gaming scenarios, commentator and guide, selected for their suitability for automatic evaluation. We introduce the Live Gaming Benchmark, a large-scale dataset with three representative scenarios: solo commentary, co-commentary, and user guidance, and present Proact-VL, a general framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction. Extensive experiments show Proact-VL achieves superior response latency and quality while maintaining strong video understanding capabilities, demonstrating its practicality for real-time interactive applications.
翻译:主动、实时的交互体验是实现类人AI伴侣的关键,却面临三大挑战:(1) 在连续流式输入下实现低延迟推理;(2) 自主决定何时响应;(3) 控制生成内容的质量与数量以满足实时约束。在本工作中,我们通过解说员和向导两种游戏场景实例化AI伴侣,这两种场景因适合自动评测而被选中。我们引入了Live Gaming Benchmark——一个包含独白解说、双人解说和用户引导三种代表性场景的大规模数据集,并提出了Proact-VL,这是一个通用框架,可将多模态语言模型塑造为具备类人环境感知与交互能力的主动式实时交互智能体。大量实验表明,Proact-VL在保持强大视频理解能力的同时,实现了优越的响应延迟与质量,证明了其在实时交互应用中的实用性。