Asymmetric actor-critic methods are widely used in partially observable reinforcement learning, but typically assume full state observability to condition the critic during training, which is often unrealistic in practice. We introduce the informed asymmetric actor-critic framework, allowing the critic to be conditioned on arbitrary state-dependent privileged signals without requiring access to the full state. We show that any such privileged signal yields unbiased policy gradient estimates, substantially expanding the set of admissible privileged information. This raises the problem of selecting the most adequate privileged information in order to improve learning. For this purpose, we propose two novel informativeness criteria: a dependence-based test that can be applied prior to training, and a criterion based on improvements in value prediction accuracy that can be applied post-hoc. Empirical results on partially observable benchmark tasks and synthetic environments demonstrate that carefully selected privileged signals can match or outperform full-state asymmetric baselines while relying on strictly less state information.
翻译:不对称行动者-评论者方法在部分可观测强化学习中被广泛使用,但通常假设评论者在训练期间能够基于全状态进行条件化,这在实践中往往不切实际。我们提出了基于特权信号的不对称行动者-评论者框架,允许评论者基于任意与状态相关的特权信号进行条件化,而无需访问完整状态。我们证明任何此类特权信号都能产生无偏的策略梯度估计,从而显著扩展了可采纳特权信息的范围。这引出了如何选择最合适的特权信息以改进学习的问题。为此,我们提出了两种新的信息量准则:一种可在训练前应用的基于依赖关系的测试,以及一种可在训练后应用的基于价值预测精度改进的准则。在部分可观测基准任务和合成环境中的实验结果表明,经过精心选择的特权信号能够匹配甚至超越基于全状态的不对称基线方法,同时依赖严格更少的状态信息。