Multimodal Large Language Models (MLLMs) have achieved strong performance across many tasks, yet most systems remain limited to offline inference, requiring complete inputs before generating outputs. Recent streaming methods reduce latency by interleaving perception and generation, but still enforce a sequential perception-generation cycle, limiting real-time interaction. In this work, we target a fundamental bottleneck that arises when extending MLLMs to real-time video understanding: the global positional continuity constraint imposed by standard positional encoding schemes. While natural in offline inference, this constraint tightly couples perception and generation, preventing effective input-output parallelism. To address this limitation, we propose a parallel streaming framework that relaxes positional continuity through three designs: Overlapped, Group-Decoupled, and Gap-Isolated. These designs enable simultaneous perception and generation, allowing the model to process incoming inputs while producing responses in real time. Extensive experiments reveal that Group-Decoupled achieves the best efficiency-performance balance, maintaining high fluency and accuracy while significantly reducing latency. We further show that the proposed framework yields up to 2x acceleration under balanced perception-generation workloads, establishing a principled pathway toward speak-while-watching real-time systems. We make all our code publicly available: https://github.com/EIT-NLP/Speak-While-Watching.
翻译:多模态大语言模型(MLLMs)已在众多任务中展现出强大性能,然而大多数系统仍局限于离线推理,需要在获得完整输入后才能生成输出。近期的流式处理方法通过交错感知与生成来降低延迟,但仍强制遵循顺序的感知-生成循环,限制了实时交互能力。本研究针对将MLLMs扩展至实时视频理解时出现的一个根本性瓶颈:标准位置编码方案所施加的全局位置连续性约束。该约束在离线推理中虽属自然,却将感知与生成紧密耦合,阻碍了有效的输入输出并行化。为突破此限制,我们提出一种并行流式框架,通过三种设计——重叠式、分组解耦式与间隙隔离式——来松弛位置连续性。这些设计实现了感知与生成的同步进行,使模型能够实时处理持续输入的同时产生响应。大量实验表明,分组解耦式设计在效率与性能间取得了最佳平衡,在保持高流畅度与准确性的同时显著降低了延迟。我们进一步证明,在感知与生成负载均衡的场景下,所提框架可实现高达2倍的加速,为构建“边看边说”的实时系统建立了理论化路径。所有代码均已公开:https://github.com/EIT-NLP/Speak-While-Watching。