While recent advances in Reinforcement Fine-Tuning (RFT) have shown that rule-based reward schemes can enable effective post-training for large language models, their extension to cross-modal, vision-centric domains remains largely underexplored. This limitation is especially pronounced in the medical imaging domain, where effective performance requires both robust visual perception and structured reasoning. In this work, we address this gap by proposing VRFT-Aug, a visual reinforcement fine-tuning framework tailored for the medical domain. VRFT-Aug introduces a series of training strategies designed to augment both perception and reasoning, including prior knowledge injection, perception-driven policy refinement, medically informed reward shaping, and behavioral imitation. Together, these methods aim to stabilize and improve the RFT process. Through extensive experiments across multiple medical datasets, we show that our approaches consistently outperform both standard supervised fine-tuning and RFT baselines. Moreover, we provide empirically grounded insights and practical training heuristics that can be generalized to other medical image tasks. We hope this work contributes actionable guidance and fresh inspiration for the ongoing effort to develop reliable, reasoning-capable models for high-stakes medical applications.
翻译:尽管强化微调(RFT)的最新进展表明,基于规则的奖励方案能够有效实现大型语言模型的训练后优化,但其向跨模态、以视觉为中心领域的扩展在很大程度上仍未得到充分探索。这一局限在医学影像领域尤为突出,因为该领域的有效性能既需要鲁棒的视觉感知,也需要结构化的推理能力。在本工作中,我们通过提出 VRFT-Aug 来填补这一空白,这是一个专为医学领域定制的视觉强化微调框架。VRFT-Aug 引入了一系列旨在增强感知与推理能力的训练策略,包括先验知识注入、感知驱动的策略优化、医学知识引导的奖励塑形以及行为模仿。这些方法共同致力于稳定并改进 RFT 过程。通过在多个医学数据集上的广泛实验,我们证明我们的方法在性能上持续优于标准的监督微调和 RFT 基线。此外,我们提供了基于实证的见解和实用的训练启发式方法,这些方法可推广至其他医学影像任务。我们希望这项工作能为当前开发适用于高风险医疗应用、具备可靠推理能力的模型提供可操作的指导与新的启发。