Omni-modal reasoning is essential for intelligent systems to understand and draw inferences from diverse data sources. While existing omni-modal large language models (OLLM) excel at perceiving diverse modalities, they lack the complex reasoning abilities of recent large reasoning models (LRM). However, enhancing the reasoning ability of OLLMs through additional training presents significant challenges, including the need for high-quality data, task-specific adaptation, and substantial computational costs. To address these limitations, we propose ThinkOmni, a training-free and data-free framework that lifts textual reasoning to omni-modal scenarios. ThinkOmni introduces two key components: 1) LRM-as-a-Guide, which leverages off-the-shelf LRMs to guide the OLLM decoding process; 2) Stepwise Contrastive Scaling, which adaptively balances perception and reasoning signals without manual hyperparameter tuning. Experiments on six multi-modal reasoning benchmarks demonstrate that ThinkOmni consistently delivers performance improvements, with main results achieving 70.2 on MathVista and 75.5 on MMAU. Overall, ThinkOmni offers a flexible and generalizable solution for omni-modal reasoning and provides new insights into the generalization and application of reasoning capabilities.
翻译:全模态推理对于智能系统理解和从多样化数据源中推断至关重要。尽管现有的全模态大语言模型(OLLM)擅长感知多种模态,但它们缺乏近期大型推理模型(LRM)所具备的复杂推理能力。然而,通过额外训练来增强OLLM的推理能力面临重大挑战,包括对高质量数据的需求、任务特定的适配以及高昂的计算成本。为应对这些局限,我们提出了ThinkOmni,一个无需训练和额外数据的框架,可将文本推理能力提升至全模态场景。ThinkOmni引入两个核心组件:1)LRM-as-a-Guide,利用现成的LRM来引导OLLM的解码过程;2)Stepwise Contrastive Scaling,自适应地平衡感知与推理信号,无需手动调整超参数。在六个多模态推理基准测试上的实验表明,ThinkOmni持续带来性能提升,主要结果在MathVista上达到70.2分,在MMAU上达到75.5分。总体而言,ThinkOmni为全模态推理提供了一个灵活且可泛化的解决方案,并为推理能力的泛化与应用提供了新的见解。