Recent advances in reasoning models have shown remarkable progress in text-based domains, but transferring those capabilities to multimodal settings, e.g., to allow reasoning over audio-visual data, still remains a challenge, in part because of the limited availability of high-quality reasoning data in targeted multimodal combinations. To address this problem, we introduce AVRT, a novel framework that generates high-quality audio-visual reasoning traces from single-modality teacher models. We generate independent vision- and audio-reasoning traces via models specialized to reason over their respective modalities and merge the resulting traces with an LLM merger model. The resulting multimodal traces are used in a supervised fine-tuning (SFT) cold start to adapt the target model to audio-visual reasoning traces first, before training it in a second reinforcement learning stage on larger-scale data. Evaluated on seven audio-visual and audio benchmarks, our 3B and 7B parameter models achieve state-of-the-art results among models of comparable size including OmniBench and DailyOmni for audio-visual and MMAR for audio-only reasoning, showing that cross-modal training also transfers to single-modality tasks and establishing a new training pipeline for multimodal reasoning models.
翻译:近期推理模型在文本领域取得了显著进展,但将此类能力迁移至多模态场景(如实现音视频数据的推理)仍面临挑战,部分原因在于目标多模态组合的高质量推理数据稀缺。为解决这一问题,我们提出新颖框架AVRT,通过单模态教师模型生成高质量音视频推理轨迹。首先利用擅长处理视觉与听觉模态的模型分别生成独立的视觉与音频推理轨迹,再通过大语言模型(LLM)融合模块将两类轨迹合并。由此生成的多模态轨迹先用于监督微调(SFT)冷启动,使目标模型适应音视频推理轨迹,随后在第二阶段的强化学习训练中基于更大规模数据继续优化。在六个音视频与音频基准测试上的评估表明,我们3B与7B参数量的模型在同类规模模型中取得了最优结果(含音视频推理基准OmniBench、DailyOmni及纯音频推理基准MMAR),证实了跨模态训练可迁移至单模态任务,并为多模态推理模型建立了新的训练范式。