Optimistic Policy Learning under Pessimistic Adversaries with Regret and Violation Guarantees

Real-world decision-making systems operate in environments where state transitions depend not only on the agent's actions, but also on \textbf{exogenous factors outside its control}--competing agents, environmental disturbances, or strategic adversaries--formally, $s_{h+1} = f(s_h, a_h, \bar{a}_h)+ω_h$ where $\bar{a}_h$ is the adversary/external action, $a_h$ is the agent's action, and $ω_h$ is an additive noise. Ignoring such factors can yield policies that are optimal in isolation but \textbf{fail catastrophically in deployment}, particularly when safety constraints must be satisfied. Standard Constrained MDP formulations assume the agent is the sole driver of state evolution, an assumption that breaks down in safety-critical settings. Existing robust RL approaches address this via distributional robustness over transition kernels, but do not explicitly model the \textbf{strategic interaction} between agent and exogenous factor, and rely on strong assumptions about divergence from a known nominal model. We model the exogenous factor as an \textbf{adversarial policy} $\barπ$ that co-determines state transitions, and ask how an agent can remain both optimal and safe against such an adversary. \emph{To the best of our knowledge, this is the first work to study safety-constrained RL under explicit adversarial dynamics}. We propose \textbf{Robust Hallucinated Constrained Upper-Confidence RL} (\texttt{RHC-UCRL}), a model-based algorithm that maintains optimism over both agent and adversary policies, explicitly separating epistemic from aleatoric uncertainty. \texttt{RHC-UCRL} achieves sub-linear regret and constraint violation guarantees.

翻译：现实世界的决策系统运行在状态转移不仅取决于智能体行为、还取决于\textbf{其控制之外的外生因素}——竞争智能体、环境扰动或策略性对手——的环境中，形式化表示为 $s_{h+1} = f(s_h, a_h, \bar{a}_h)+ω_h$，其中 $\bar{a}_h$ 是对手/外部动作，$a_h$ 是智能体动作，$ω_h$ 是加性噪声。忽略此类因素可能产生在孤立环境中最优、但\textbf{在部署时灾难性失败}的策略，尤其是在必须满足安全约束的情况下。标准约束马尔可夫决策过程（CMDP）假设智能体是状态演化的唯一驱动力，这一假设在安全关键场景中失效。现有鲁棒强化学习方法通过转移核的分布鲁棒性应对此问题，但未显式建模智能体与外生因素之间的\textbf{策略性交互}，且依赖于与已知名义模型之间散度的强假设。我们将外生因素建模为共同决定状态转移的\textbf{对抗性策略} $\barπ$，并探究智能体如何在面对此类对手时同时保持最优性与安全性。\emph{据我们所知，这是首个研究显式对抗性动力学下安全约束强化学习的工作}。我们提出\textbf{鲁棒幻觉约束上置信界强化学习}（\texttt{RHC-UCRL}），一种基于模型、同时对智能体与对手策略保持乐观主义的算法，明确分离认知不确定性与偶然不确定性。\texttt{RHC-UCRL} 实现了亚线性遗憾值与约束违反保证。