Recent Omni-multimodal Large Language Models show promise in unified audio, vision, and text modeling. However, streaming audio-video understanding remains challenging, as existing approaches suffer from disjointed capabilities: they typically exhibit incomplete modality support or lack autonomous proactive monitoring. To address this, we present ROMA, a real-time omni-multimodal assistant for unified reactive and proactive interaction. ROMA processes continuous inputs as synchronized multimodal units, aligning dense audio with discrete video frames to handle granularity mismatches. For online decision-making, we introduce a lightweight speak head that decouples response initiation from generation to ensure precise triggering without task conflict. We train ROMA with a curated streaming dataset and a two-stage curriculum that progressively optimizes for streaming format adaptation and proactive responsiveness. To standardize the fragmented evaluation landscape, we reorganize diverse benchmarks into a unified suite covering both proactive (alert, narration) and reactive (QA) settings. Extensive experiments across 12 benchmarks demonstrate ROMA achieves state-of-the-art performance on proactive tasks while competitive in reactive settings, validating its robustness in unified real-time omni-multimodal understanding.
翻译:近期的全模态大语言模型在统一音频、视觉与文本建模方面展现出潜力。然而,流式音视频理解仍具挑战,现有方法存在能力割裂问题:通常表现为模态支持不完整或缺乏自主主动监控能力。为此,我们提出ROMA——一个用于统一反应式与主动式交互的实时全模态助手。ROMA将连续输入处理为同步的多模态单元,通过将密集音频与离散视频帧对齐来处理粒度失配问题。针对在线决策,我们引入轻量级语音输出头,将响应触发与生成过程解耦,确保在无任务冲突的情况下实现精准触发。我们使用精心构建的流式数据集和两阶段课程训练ROMA,逐步优化其对流式格式的适应性与主动响应能力。为整合碎片化的评估体系,我们将多样化基准测试重组为统一套件,涵盖主动式(警报、叙述)与反应式(问答)场景。在12个基准测试上的大量实验表明,ROMA在主动任务上达到最先进性能,同时在反应式场景中保持竞争力,验证了其在统一实时全模态理解任务中的鲁棒性。