Long-duration streaming video understanding is fundamental for future AI agents, yet remains limited by ineffective long-term memory. We introduce video-SALMONN S, a memory-enhanced streaming audio-visual large language model that processes over 3-hour videos at 1 FPS and 360p resolution, outperforming strong non-streaming models under the same memory budget. In addition to token merging or downsampling, video-SALMONN S is the first to employ test-time training (TTT) as a streaming memory mechanism for video understanding. TTT continuously transforms short-term multimodal representations into long-term memory embedded in model parameters. To improve long-range dependency modeling and memory capacity, we propose (i) a TTT_MEM layer with an additional long-span prediction objective, (ii) a two-stage training scheme, and (iii) a modality-aware memory reader. We further introduce the Episodic Learning from Video Memory (ELViM) benchmark, simulating agent-like scenarios where models must learn from videos observed hours earlier. video-SALMONN S consistently outperforms both streaming and non-streaming baselines by 3-7% on long video benchmarks. Notably, video-SALMONN S achieves a 15% absolute accuracy improvement over strong non-streaming models on ELViM, demonstrating strong learning abilities from video memory.
翻译:长时流式视频理解是未来AI智能体的基础能力,但其发展仍受限于低效的长时记忆机制。本文提出video-SALMONN S,一种具备记忆增强能力的流式视听大语言模型,能够以1 FPS和360p分辨率处理超过3小时的视频,在相同内存预算下性能优于强大的非流式模型。除令牌合并或降采样外,video-SALMONN S首次将测试时训练(TTT)作为视频理解的流式记忆机制。TTT持续将短期多模态表征转化为嵌入模型参数的长期记忆。为提升长程依赖建模与记忆容量,我们提出:(i)具有长跨度预测目标的TTT_MEM层,(ii)两阶段训练方案,以及(iii)模态感知记忆读取器。我们进一步构建了视频记忆情景学习(ELViM)基准,模拟智能体需从数小时前观测视频中学习的场景。在长视频基准测试中,video-SALMONN S持续以3-7%的优势超越流式与非流式基线模型。值得注意的是,在ELViM基准上,video-SALMONN S相较强非流式模型实现了15%的绝对准确率提升,展现出从视频记忆中学习的强大能力。