Reinforcement Learning (RL) lacks benchmarks that enable precise, white-box diagnostics of agent behavior. Current environments often entangle complexity factors and lack ground-truth optimality metrics, making it difficult to isolate why algorithms fail. We introduce Synthetic Monitoring Environments (SMEs), an infinite suite of continuous control tasks. SMEs provide fully configurable task characteristics and known optimal policies. As such, SMEs allow for the exact calculation of instantaneous regret. Their rigorous geometric state space bounds allow for systematic within-distribution (WD) and out-of-distribution (OOD) evaluation. We demonstrate the framework's benefit through multidimensional ablations of PPO, TD3, and SAC, revealing how specific environmental properties - such as action or state space size, reward sparsity and complexity of the optimal policy - impact WD and OOD performance. We thereby show that SMEs offer a standardized, transparent testbed for transitioning RL evaluation from empirical benchmarking toward rigorous scientific analysis.
翻译:强化学习(RL)领域缺乏能够对智能体行为进行精确白盒诊断的基准测试环境。现有环境通常将多种复杂性因素交织在一起,且缺乏真实的最优性度量标准,导致难以定位算法失效的具体原因。本文提出合成监控环境(SMEs)——一个包含无限连续控制任务的测试套件。SMEs提供完全可配置的任务特性与已知的最优策略,从而支持瞬时遗憾值的精确计算。其严谨的几何状态空间边界允许系统性地进行分布内(WD)与分布外(OOD)性能评估。我们通过对PPO、TD3和SAC算法进行多维消融实验,展示了该框架的实用价值:揭示了动作/状态空间维度、奖励稀疏性及最优策略复杂度等特定环境属性如何影响WD与OOD性能。研究表明,SMEs为强化学习评估从经验性基准测试转向严谨的科学分析提供了标准化、透明化的测试平台。