Multitask Reinforcement Learning (MTRL) approaches have gained increasing attention for its wide applications in many important Reinforcement Learning (RL) tasks. However, while recent advancements in MTRL theory have focused on the improved statistical efficiency by assuming a shared structure across tasks, exploration--a crucial aspect of RL--has been largely overlooked. This paper addresses this gap by showing that when an agent is trained on a sufficiently diverse set of tasks, a generic policy-sharing algorithm with myopic exploration design like $\epsilon$-greedy that are inefficient in general can be sample-efficient for MTRL. To the best of our knowledge, this is the first theoretical demonstration of the "exploration benefits" of MTRL. It may also shed light on the enigmatic success of the wide applications of myopic exploration in practice. To validate the role of diversity, we conduct experiments on synthetic robotic control environments, where the diverse task set aligns with the task selection by automatic curriculum learning, which is empirically shown to improve sample-efficiency.
翻译:多任务强化学习方法因其在众多重要强化学习任务中的广泛应用而日益受到关注。然而,尽管近期多任务强化学习理论的研究进展侧重于通过假设任务间共享结构来提升统计效率,但探索这一强化学习中的关键环节却被很大程度上忽略了。本文通过证明:当智能体在足够多样化的任务集合上进行训练时,采用短视探索设计(如ε-贪心算法)的通用策略共享方法——尽管在一般情况下效率较低——可以在多任务强化学习中实现样本高效性。据我们所知,这是首次从理论上论证多任务强化学习的"探索优势"。这一发现也可能揭示实践中短视探索广泛应用的谜题性成功。为验证多样性的作用,我们在合成机器人控制环境中进行了实验,其中多样化任务集合与自动课程学习的任务选择相一致,实验表明该方法能有效提升样本效率。