Action advising endeavors to leverage supplementary guidance from expert teachers to alleviate the issue of sampling inefficiency in Deep Reinforcement Learning (DRL). Previous agent-specific action advising methods are hindered by imperfections in the agent itself, while agent-agnostic approaches exhibit limited adaptability to the learning agent. In this study, we propose a novel framework called Agent-Aware trAining yet Agent-Agnostic Action Advising (A7) to strike a balance between the two. The underlying concept of A7 revolves around utilizing the similarity of state features as an indicator for soliciting advice. However, unlike prior methodologies, the measurement of state feature similarity is performed by neither the error-prone learning agent nor the agent-agnostic advisor. Instead, we employ a proxy model to extract state features that are both discriminative (adaptive to the agent) and generally applicable (robust to agent noise). Furthermore, we utilize behavior cloning to train a model for reusing advice and introduce an intrinsic reward for the advised samples to incentivize the utilization of expert guidance. Experiments are conducted on the GridWorld, LunarLander, and six prominent scenarios from Atari games. The results demonstrate that A7 significantly accelerates the learning process and surpasses existing methods (both agent-specific and agent-agnostic) by a substantial margin. Our code will be made publicly available.
翻译:动作建议旨在利用专家教师的额外指导来缓解深度强化学习中的采样效率低下问题。以往基于智能体特性的动作建议方法受限于智能体自身的不完善性,而智能体无关方法对学习智能体的适应性有限。本研究提出一种名为"智能体感知训练与智能体无关动作建议"(A7)的新框架,以平衡二者之间的折中。A7的核心思想在于利用状态特征的相似性作为建议请求的指标。与先前方法不同,状态特征相似性的测量既非由易出错的学习智能体完成,也非由智能体无关的建议器实现。我们采用代理模型提取兼具判别性(适应智能体)和通用性(对智能体噪声鲁棒)的状态特征。进一步地,我们运用行为克隆训练建议复用模型,并引入针对建议样本的内在奖励以激励专家指导的利用。在GridWorld、LunarLander及Atari游戏的六个典型场景中进行实验,结果表明A7显著加速了学习过程,并以较大优势超越现有方法(包括智能体特定与智能体无关方法)。我们的代码将公开提供。