Robustness under latent distribution shift remains challenging in partially observable reinforcement learning. We formalize a focused setting where an adversary selects a hidden initial latent distribution before the episode, termed an adversarial latent-initial-state POMDP. Theoretically, we prove a latent minimax principle, characterize worst-case defender distributions, and derive approximate best-response inequalities with finite-sample concentration bounds that make the optimization and sampling terms explicit. Empirically, using a Battleship benchmark, we demonstrate that targeted exposure to shifted latent distributions reduces average robustness gaps between Spread and Uniform distributions from 10.3 to 3.1 shots at equal budget. Furthermore, iterative best-response training exhibits budget-sensitive behavior that is qualitatively consistent with the theorem-guided diagnostics once one accounts for discounted PPO surrogates and finite-sample noise. Ultimately, we show that for latent-initial-state problems, the framework yields a clean evaluation game and useful theorem-motivated diagnostics while also making clear where implementation-level surrogates and optimization limits enter.
翻译:在部分可观测强化学习中,潜在分布偏移下的鲁棒性仍然具有挑战性。我们形式化了一个聚焦设定:在回合开始前,对手选择一个隐藏的初始潜在分布,称为对抗性潜在初始状态部分可观测马尔可夫决策过程。理论上,我们证明了潜在极小极大原理,刻画了最坏情况下的防御者分布,并推导了具有有限样本集中界的最优响应近似不等式,这些不等式显式地表达了优化项与采样项。在实证研究中,通过使用海战基准测试,我们证明:在相同预算下,针对偏移潜在分布的有针对性暴露,能将Spread与Uniform分布之间的平均鲁棒性差距从10.3次射击降低至3.1次射击。此外,迭代最优响应训练展现出预算敏感行为,一旦考虑到折扣近端策略优化替代目标与有限样本噪声,该行为在性质上与定理指导的诊断结果一致。最终,我们表明对于潜在初始状态问题,该框架提供了一个清晰的评估博弈和实用的定理驱动诊断工具,同时明确了实现层面的替代目标与优化限制所在。