Mode-dependent architectural components (layers that behave differently during training and evaluation, such as Batch Normalization or dropout) are commonly used in visual reinforcement learning but can destabilize on-policy optimization. We show that in Proximal Policy Optimization (PPO), discrepancies between training and evaluation behavior induced by Batch Normalization lead to policy mismatch, distributional drift, and reward collapse. We propose Mode-Dependent Rectification (MDR), a lightweight dual-phase training procedure that stabilizes PPO under mode-dependent layers without architectural changes. Experiments across procedurally generated games and real-world patch-localization tasks demonstrate that MDR consistently improves stability and performance, and extends naturally to other mode-dependent layers.
翻译:模式依赖架构组件(在训练与评估阶段表现不同的层,如批归一化或随机失活)在视觉强化学习中广泛使用,但可能破坏同策略优化的稳定性。本文证明,在近端策略优化(PPO)中,批归一化引起的训练与评估行为差异会导致策略失配、分布漂移和奖励崩溃。我们提出模式依赖修正(MDR),这是一种轻量级双阶段训练方法,可在不改变架构的情况下稳定模式依赖层下的PPO训练。在程序生成游戏和现实世界补丁定位任务上的实验表明,MDR能持续提升训练稳定性与性能表现,并可自然扩展至其他模式依赖层。