Robust perception and reasoning require consistency across sensory modalities. Yet current multimodal models often violate this principle, yielding contradictory predictions for visual and textual representations of the same concept. Rather than masking these failures with standard voting mechanisms, which can amplify systematic biases, we show that cross-modal inconsistency provides a rich and natural signal for learning. We introduce RC2, a reinforcement learning framework that resolves internal conflicts by enforcing cross-modal cycle consistency. By requiring a model to perform backward inference, switch modalities, and reliably reconstruct the answer through forward inference, we obtain a dense, label-free reward. This cyclic constraint encourages the model to align its internal representations autonomously. Optimizing for this structure mitigates modality-specific errors and improves reasoning accuracy by up to 7.6 points. Our results suggest that advanced reasoning emerges not only from scaling data, but also from enforcing a structurally consistent understanding of the world.
翻译:稳健的感知与推理要求跨感官模态的一致性。然而,当前的多模态模型常常违背这一原则,对同一概念的不同视觉和文本表征产生矛盾的预测。我们证明,跨模态不一致性为学习提供了丰富而自然的信号,而非像标准投票机制那样掩盖这些错误——后者可能放大系统性偏差。我们提出R-C2,一种通过强制跨模态循环一致性来解决内部冲突的强化学习框架。通过要求模型执行反向推理、切换模态并基于前向推理可靠地重构答案,我们获得了密集且无标签的奖励。这种循环约束促使模型自主对齐其内部表征。针对该结构进行优化可缓解模态特定错误,并将推理准确率提升高达7.6个百分点。我们的结果表明,高级推理不仅源于数据规模的扩展,更源于对世界施加结构一致的理解。