Reinforced Curriculum Pre-Alignment for Domain-Adaptive VLMs

Vision-Language Models (VLMs) demonstrate remarkable general-purpose capabilities but often fall short in specialized domains such as medical imaging or geometric problem-solving. Supervised Fine-Tuning (SFT) can enhance performance within a target domain, but it typically causes catastrophic forgetting, limiting its generalization. The central challenge, therefore, is to adapt VLMs to new domains while preserving their general-purpose capabilities. Continual pretraining is effective for expanding knowledge in Large Language Models (LLMs), but it is less feasible for VLMs due to prohibitive computational costs and the unavailability of pretraining data for most open-source models. This necessitates efficient post-training adaptation methods. Reinforcement learning (RL)-based approaches such as Group Relative Policy Optimization (GRPO) have shown promise in preserving general abilities, yet they often fail in domain adaptation scenarios where the model initially lacks sufficient domain knowledge, leading to optimization collapse. To bridge this gap, we propose Reinforced Curriculum Pre-Alignment (RCPA), a novel post-training paradigm that introduces a curriculum-aware progressive modulation mechanism. In the early phase, RCPA applies partial output constraints to safely expose the model to new domain concepts. As the model's domain familiarity increases, training gradually transitions to full generation optimization, refining responses and aligning them with domain-specific preferences. This staged adaptation balances domain knowledge acquisition with the preservation of general multimodal capabilities. Extensive experiments across specialized domains and general benchmarks validate the effectiveness of RCPA, establishing a practical pathway toward building high-performing and domain-adaptive VLMs.

翻译：视觉语言模型（VLMs）展现出卓越的通用能力，但在医学影像或几何问题求解等专业领域中往往表现不足。监督微调（SFT）虽能提升模型在目标领域内的性能，但通常会导致灾难性遗忘，从而限制其泛化能力。因此，核心挑战在于使VLMs适应新领域的同时保持其通用能力。持续预训练对于扩展大语言模型（LLMs）的知识是有效的，但由于计算成本过高且大多数开源模型缺乏预训练数据，该方法对VLMs的可行性较低。这催生了对高效后训练适应方法的需求。基于强化学习（RL）的方法，如群体相对策略优化（GRPO），在保持通用能力方面已显示出潜力，但在领域适应场景中，当模型初始阶段缺乏足够领域知识时，常因优化崩溃而失败。为弥合这一差距，我们提出强化课程预对齐（RCPA），一种新颖的后训练范式，引入了课程感知的渐进调制机制。在早期阶段，RCPA应用部分输出约束，安全地将模型暴露于新领域概念。随着模型对领域的熟悉度增加，训练逐步过渡至完整生成优化，精炼响应并使其与领域特定偏好对齐。这种分阶段适应平衡了领域知识获取与通用多模态能力的保持。在多个专业领域及通用基准上的广泛实验验证了RCPA的有效性，为构建高性能且领域自适应的VLMs开辟了一条实用路径。