Multimodal large language models (MLLMs) have shown remarkable capabilities in multimodal perception and understanding tasks. However, their effectiveness in specialized domains, such as remote sensing and medical imaging, remains limited. A natural approach to domain adaptation is to inject domain knowledge through textual instructions, prompts, or auxiliary captions. Surprisingly, we find that such input-level domain knowledge injection yields little to no improvement on scientific multimodal tasks, even when the domain knowledge is explicitly provided. This observation suggests that current MLLMs fail to internalize domain-specific priors through language alone, and that domain knowledge must be integrated at the optimization level. Motivated by this insight, we propose a reinforcement fine-tuning framework that incorporates domain knowledge directly into the learning objective. Instead of treating domain knowledge as descriptive information, we encode it as domain-informed constraints and reward signals, shaping the model's behavior in the output space. Extensive experiments across multiple datasets in remote sensing and medical domains consistently demonstrate good performance gains, achieving state-of-the-art results on multimodal domain tasks. Our results highlight the necessity of optimization-level domain knowledge integration and reveal a fundamental limitation of textual domain conditioning in current MLLMs.
翻译:多模态大语言模型(MLLMs)在多模态感知与理解任务中展现出卓越的能力。然而,在遥感、医学影像等专业领域,其性能仍存在局限。一种自然的领域适应方法是通过文本指令、提示或辅助描述注入领域知识。令人意外的是,我们发现即使明确提供领域知识,此类输入级的领域知识注入在科学多模态任务上几乎无法带来性能提升。这一现象表明,当前MLLM无法仅通过语言内化领域特定的先验知识,领域知识必须在优化层面进行整合。基于此洞见,我们提出一种强化微调框架,将领域知识直接融入学习目标。不同于将领域知识视为描述性信息,我们将其编码为领域感知的约束条件与奖励信号,从而在输出空间塑造模型行为。在遥感与医学领域多个数据集上的大量实验一致证明了该方法带来的显著性能提升,在多模态领域任务上取得了最先进的结果。我们的研究凸显了优化级领域知识整合的必要性,并揭示了当前MLLM中文本领域条件化机制的根本性局限。