Reinforcement learning is a powerful technique for learning from trial and error, but it often requires a large number of interactions to achieve good performance. In some domains, such as sparse-reward tasks, an oracle that can provide useful feedback or guidance to the agent during the learning process is really of great importance. However, querying the oracle too frequently may be costly or impractical, and the oracle may not always have a clear answer for every situation. Therefore, we propose a novel method for interacting with the oracle in a selective and efficient way, using a retrieval-based approach. We assume that the interaction can be modeled as a sequence of templated questions and answers, and that there is a large corpus of previous interactions available. We use a neural network to encode the current state of the agent and the oracle, and retrieve the most relevant question from the corpus to ask the oracle. We then use the oracle's answer to update the agent's policy and value function. We evaluate our method on an object manipulation task. We show that our method can significantly improve the efficiency of RL by reducing the number of interactions needed to reach a certain level of performance, compared to baselines that do not use the oracle or use it in a naive way.
翻译:强化学习是一种通过试错进行学习的强大技术,但通常需要大量交互才能获得良好性能。在稀疏奖励任务等某些领域,能够在学习过程中为智能体提供有用反馈或指导的专家系统确实至关重要。然而,频繁查询专家可能代价高昂或不可行,且专家并非总能对每种情况给出明确答案。因此,我们提出了一种基于检索的交互方法,以选择性且高效的方式与专家进行交互。我们假设交互过程可建模为模板化问答序列,并假设存在包含大量历史交互的语料库。通过神经网络对智能体与专家的当前状态进行编码,从语料库中检索最相关的问题向专家提问,随后利用专家的回答更新智能体的策略与价值函数。我们在物体操控任务上评估了该方法。实验表明,与不使用专家或简单使用专家的基线方法相比,本方法可显著降低达到特定性能水平所需的交互次数,从而有效提升强化学习效率。