In this paper, we present the Federated Upper Confidence Bound Value Iteration algorithm ($\texttt{Fed-UCBVI}$), a novel extension of the $\texttt{UCBVI}$ algorithm (Azar et al., 2017) tailored for the federated learning framework. We prove that the regret of $\texttt{Fed-UCBVI}$ scales as $\tilde{\mathcal{O}}(\sqrt{H^3 |\mathcal{S}| |\mathcal{A}| T / M})$, with a small additional term due to heterogeneity, where $|\mathcal{S}|$ is the number of states, $|\mathcal{A}|$ is the number of actions, $H$ is the episode length, $M$ is the number of agents, and $T$ is the number of episodes. Notably, in the single-agent setting, this upper bound matches the minimax lower bound up to polylogarithmic factors, while in the multi-agent scenario, $\texttt{Fed-UCBVI}$ has linear speed-up. To conduct our analysis, we introduce a new measure of heterogeneity, which may hold independent theoretical interest. Furthermore, we show that, unlike existing federated reinforcement learning approaches, $\texttt{Fed-UCBVI}$'s communication complexity only marginally increases with the number of agents.
翻译:本文提出联邦上置信界值迭代算法($\texttt{Fed-UCBVI}$),这是对$\texttt{UCBVI}$算法(Azar等人,2017)的创新型扩展,专为联邦学习框架设计。我们证明$\texttt{Fed-UCBVI}$的遗憾上界为$\tilde{\mathcal{O}}(\sqrt{H^3 |\mathcal{S}| |\mathcal{A}| T / M})$,其中包含由异构性导致的微小附加项。此处$|\mathcal{S}|$表示状态数,$|\mathcal{A}|$表示动作数,$H$为回合长度,$M$为智能体数量,$T$为回合总数。值得注意的是,在单智能体场景中,该上界在多项式对数因子范围内达到极小极大下界;而在多智能体场景中,$\texttt{Fed-UCBVI}$具有线性加速特性。为完成理论分析,我们引入了一种新的异构性度量方法,该方法可能具有独立的学术价值。此外,与现有联邦强化学习方法不同,$\texttt{Fed-UCBVI}$的通信复杂度仅随智能体数量边际增长。