Multimodal Language Models have gained significant traction for their ability to process diverse input data types and generate coherent, contextually relevant outputs across various applications. While supervised fine-tuning (SFT) has been the predominant approach to enhance MLLM capabilities in task-specific optimization, it often falls short in fostering crucial generalized reasoning abilities. Despite the potential of reinforcement learning (RL) to address these limitations, it faces two issues: (1) its generalized capabilities in multimodal tasks remain underexplored. (2) its training constraints such as constant Kullback-Leibler or clamp strategy easily lead to suboptimal bottleneck. To adress these issues, we introduce OThink-MR1, a framework that extends RL to MLLMs, enabling them to achieve deeper understanding and reasoning across multimodal tasks. We design a dynamic Kullback-Leibler strategy that significantly enhances RL performance, surpassing SFT in same-task evaluations. Also, we are the first to reveal that RL exhibits remarkable cross-task generalization capabilities, which shows that models post-trained with RL on one multimodal task can be effectively transfered to another tasks. Finally, extensive experiments demonstrate the great reasoning ability of our proposed OThink-MR1.
翻译:多模态语言模型因其能够处理多样化的输入数据类型,并在各类应用中生成连贯、上下文相关的输出而受到广泛关注。尽管监督微调一直是增强MLLM在任务特定优化方面能力的主要方法,但它在培养关键的广义推理能力方面往往存在不足。尽管强化学习具备解决这些局限的潜力,但它面临两个问题:(1) 其在多模态任务中的广义能力仍未得到充分探索;(2) 其训练约束(如恒定的Kullback-Leibler散度或裁剪策略)容易导致次优瓶颈。为解决这些问题,我们提出了OThink-MR1框架,该框架将RL扩展至MLLMs,使其能够在多模态任务中实现更深层次的理解与推理。我们设计了一种动态Kullback-Leibler策略,显著提升了RL性能,在同任务评估中超越了SFT。同时,我们首次揭示了RL展现出卓越的跨任务泛化能力,这表明在一个多模态任务上通过RL进行后训练的模型,可以有效地迁移到其他任务。最后,大量实验证明了我们提出的OThink-MR1具备强大的推理能力。