While humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings, existing omnivideo models still face substantial challenges on audio-visual understanding tasks. In this paper, we propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning. OmniVideo-R1 empowers models to "think with omnimodal cues" by two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality-attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.
翻译:尽管人类通过多种协同运作的模态来感知世界,从而支持对其周围环境的整体理解,但现有的全视频模型在视听理解任务上仍面临重大挑战。本文提出OmniVideo-R1,一种新颖的增强框架,旨在改进混合模态推理。OmniVideo-R1通过两项关键策略使模型能够“借助全模态线索进行思考”:(1)基于自监督学习范式的查询密集型定位;(2)构建于对比学习范式之上的模态注意力融合。在多个基准测试上的广泛实验表明,OmniVideo-R1始终优于强基线模型,突显了其有效性与强大的泛化能力。