Recently, DARPA launched the ShELL program, which aims to explore how experience sharing can benefit distributed lifelong learning agents in adapting to new challenges. In this paper, we address this issue by conducting both theoretical and empirical research on distributed multi-task reinforcement learning (RL), where a group of $N$ agents collaboratively solves $M$ tasks without prior knowledge of their identities. We approach the problem by formulating it as linearly parameterized contextual Markov decision processes (MDPs), where each task is represented by a context that specifies the transition dynamics and rewards. To tackle this problem, we propose an algorithm called DistMT-LSVI. First, the agents identify the tasks, and then they exchange information through a central server to derive $\epsilon$-optimal policies for the tasks. Our research demonstrates that to achieve $\epsilon$-optimal policies for all $M$ tasks, a single agent using DistMT-LSVI needs to run a total number of episodes that is at most $\tilde{\mathcal{O}}({d^3H^6(\epsilon^{-2}+c_{\rm sep}^{-2})}\cdot M/N)$, where $c_{\rm sep}>0$ is a constant representing task separability, $H$ is the horizon of each episode, and $d$ is the feature dimension of the dynamics and rewards. Notably, DistMT-LSVI improves the sample complexity of non-distributed settings by a factor of $1/N$, as each agent independently learns $\epsilon$-optimal policies for all $M$ tasks using $\tilde{\mathcal{O}}(d^3H^6M\epsilon^{-2})$ episodes. Additionally, we provide numerical experiments conducted on OpenAI Gym Atari environments that validate our theoretical findings.
翻译:近期,美国国防高级研究计划局(DARPA)启动了ShELL项目,旨在探索经验共享如何帮助分布式终身学习智能体适应新挑战。本文通过理论研究和实证分析,针对一组由$N$个智能体组成的集群在无先验任务标识条件下协作求解$M$个任务的分布多任务强化学习问题展开研究。我们将该问题形式化为线性参数化的上下文马尔可夫决策过程,其中每个任务由指定转移动态与奖励函数的上下文表示。为此,我们提出DistMT-LSVI算法:智能体首先识别任务特征,随后通过中央服务器交换信息以推导出各任务的$\epsilon$-最优策略。研究表明,要实现所有$M$个任务的$\epsilon$-最优策略,采用DistMT-LSVI的单个智能体所需总运行幕数不超过$\tilde{\mathcal{O}}({d^3H^6(\epsilon^{-2}+c_{\rm sep}^{-2})}\cdot M/N)$,其中$c_{\rm sep}>0$为任务可分离性常数,$H$为每幕时间跨度,$d$为动态与奖励的特征维度。值得注意的是,DistMT-LSVI将非分布式场景的样本复杂度降低了$1/N$倍——每个智能体独立学习所有$M$个任务的$\epsilon$-最优策略时所需幕数为$\tilde{\mathcal{O}}(d^3H^6M\epsilon^{-2})$。我们在OpenAI Gym Atari环境中的数值实验验证了理论结果。