Reinforcement learning for multimodal large language models (MLLMs) is often hindered by severe reward sparsity in complex reasoning tasks. This challenge is particularly pronounced in human-centered scenarios involving states, emotions, intentions, and behaviors, where heterogeneous multimodal signals and subjective human factors make high-quality chain-of-thought (CoT) annotations expensive and difficult to obtain. Although many multimodal datasets provide expert-annotated ground-truth labels, directly using these labels for supervised fine-tuning may encourage shortcut learning in multimodal perception and provides limited transparency for safety-critical human--AI interaction. To address these limitations, we propose OmniOPSD, a Rationale-Privileged On-Policy Self-Distillation framework that uses frontier-generated rationales as teacher-side privileged evidence rather than student imitation targets. OmniOPSD uses frontier-generated evidence-aware rationales only as training-time privileged evidence context for a local teacher. The student samples its own rollout from the original multimodal input, while the rationale-privileged teacher scores the same tokens and provides dense token-level supervision. Thus, the student learns on its own trajectory distribution without directly imitating frontier-model completions, and inference requires no labels, rationales, CoT annotations, or closed-source model access. Experiments on MER-UniBench show that OmniOPSD achieves state-of-the-art performance with an average score of $84.19$, and ablations further support the value of rationale-privileged teacher guidance.
翻译:多模态大语言模型(MLLMs)在复杂推理任务中常因奖励稀疏问题而面临强化学习瓶颈。这一挑战在人机交互场景中尤为突出——当涉及状态、情感、意图与行为时,异质多模态信号与主观人为因素导致高质量思维链(CoT)标注成本高昂且难以获取。尽管许多多模态数据集提供了专家标注的真实标签,但直接将其用于监督微调可能助长多模态感知中的捷径学习,且对安全关键的人-机交互场景缺乏透明度。为解决上述问题,本文提出OmniOPSD,一种理性特权驱动的在线自蒸馏框架:该框架将前沿模型生成的推理依据作为教师侧特权证据,而非学生模仿目标。OmniOPSD仅在训练阶段将前沿模型生成的证据感知推理依据作为局部教师的特权证据上下文。学生模型基于原始多模态输入自主采样轨迹,而具有理性特权的教师模型对相同词元进行评分并提供稠密词元级监督。由此,学生模型在其自身轨迹分布上学习,无需直接模仿前沿模型输出,且推理过程无需标签、推理依据、思维链标注或闭源模型接口。在MER-UniBench上的实验表明,OmniOPSD以平均84.19分取得最优性能,消融实验进一步验证了理性特权教师指导的有效性。