Despite the promising results achieved, state-of-the-art interactive reinforcement learning schemes rely on passively receiving supervision signals from advisor experts, in the form of either continuous monitoring or pre-defined rules, which inevitably result in a cumbersome and expensive learning process. In this paper, we introduce a novel initiative advisor-in-the-loop actor-critic framework, termed as Ask-AC, that replaces the unilateral advisor-guidance mechanism with a bidirectional learner-initiative one, and thereby enables a customized and efficacious message exchange between learner and advisor. At the heart of Ask-AC are two complementary components, namely action requester and adaptive state selector, that can be readily incorporated into various discrete actor-critic architectures. The former component allows the agent to initiatively seek advisor intervention in the presence of uncertain states, while the latter identifies the unstable states potentially missed by the former especially when environment changes, and then learns to promote the ask action on such states. Experimental results on both stationary and non-stationary environments and across different actor-critic backbones demonstrate that the proposed framework significantly improves the learning efficiency of the agent, and achieves the performances on par with those obtained by continuous advisor monitoring.
翻译:摘要:尽管现有最先进的交互式强化学习方案取得了令人鼓舞的结果,但它们依赖于被动接收专家顾问的监督信号(通过持续监控或预定义规则),这不可避免地导致学习过程繁琐且代价高昂。本文提出了一种新颖的主动式顾问在环演员-评论家框架,称为Ask-AC,该框架用双向学习者主动机制替代了单向顾问引导机制,从而在学习者与顾问之间实现了定制化的高效信息交换。Ask-AC的核心是两个互补组件——动作请求器和自适应状态选择器,它们可以轻松集成到各种离散型演员-评论家架构中。前者允许智能体在面临不确定状态时主动寻求顾问干预,而后者则识别前者可能遗漏的不稳定状态(尤其在环境变化时),并学习促进对这些状态发起询问动作。在静态与非静态环境以及不同演员-评论家骨干网络上的实验结果表明,所提框架显著提升了智能体的学习效率,并达到了与持续顾问监控相当的性能水平。