Reinforcement learning agents have been mostly developed and evaluated under the assumption that they will operate in a fully autonomous manner -- they will take all actions. In this work, our goal is to develop algorithms that, by learning to switch control between agents, allow existing reinforcement learning agents to operate under different automation levels. To this end, we first formally define the problem of learning to switch control among agents in a team via a 2-layer Markov decision process. Then, we develop an online learning algorithm that uses upper confidence bounds on the agents' policies and the environment's transition probabilities to find a sequence of switching policies. The total regret of our algorithm with respect to the optimal switching policy is sublinear in the number of learning steps and, whenever multiple teams of agents operate in a similar environment, our algorithm greatly benefits from maintaining shared confidence bounds for the environments' transition probabilities and it enjoys a better regret bound than problem-agnostic algorithms. Simulation experiments in an obstacle avoidance task illustrate our theoretical findings and demonstrate that, by exploiting the specific structure of the problem, our proposed algorithm is superior to problem-agnostic algorithms.
翻译:强化学习智能体大多在完全自主运行的假设下被开发和评估——即它们将执行所有动作。本研究的目标是开发一种算法,通过学习在智能体间切换控制,使现有强化学习智能体能够适应不同自动化层级。为此,我们首先通过二层马尔可夫决策过程正式定义了团队中智能体间切换控制的学习问题。随后,我们提出一种在线学习算法,该算法利用智能体策略和环境转移概率的置信上界来寻找切换策略序列。与最优切换策略相比,我们算法的总遗憾值在学习步数上呈次线性增长;当多组智能体团队在相似环境中运行时,本算法通过维护环境转移概率的共享置信边界获得显著优势,其遗憾界优于问题无关算法。在避障任务的仿真实验中,我们的理论发现得到验证,并表明通过利用问题的特定结构,所提出的算法优于问题无关算法。