Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.
翻译:摘要:视频大语言模型(VideoLLMs)在多项视频理解任务中展现出强劲性能,但现有系统多处于离线状态,难以适用于需要持续观测与实时响应的直播视频流场景。近期面向视频流的VideoLLM研究虽取得进展,但现有方法通常依赖解耦的触发-响应流水线,或局限于字幕式描述生成,因而在开放式问答与长时交互任务中效能受限。本文提出AURA(持续理解与实时辅助系统),一种端到端流式视觉交互框架,使统一的VideoLLM能够实时处理视频流,同时支持即时问答与主动响应。该框架整合了上下文管理、数据构建、训练目标与部署优化策略,实现了稳定的长时流式交互。在流式视频基准测试中,AURA达到当前最优性能,并支持在两块80G加速器上以2 FPS运行的实时演示系统,集成语音识别(ASR)与语音合成(TTS)。我们公开AURA模型与配套的实时推理框架,以推动后续研究。