Motivated by real-world settings where data collection and policy deployment -- whether for a single agent or across multiple agents -- are costly, we study the problem of on-policy single-agent reinforcement learning (RL) and federated RL (FRL) with a focus on minimizing burn-in costs (the sample sizes needed to reach near-optimal regret) and policy switching or communication costs. In parallel finite-horizon episodic Markov Decision Processes (MDPs) with $S$ states and $A$ actions, existing methods either require superlinear burn-in costs in $S$ and $A$ or fail to achieve logarithmic switching or communication costs. We propose two novel model-free RL algorithms -- Q-EarlySettled-LowCost and FedQ-EarlySettled-LowCost -- that are the first in the literature to simultaneously achieve: (i) the best near-optimal regret among all known model-free RL or FRL algorithms, (ii) low burn-in cost that scales linearly with $S$ and $A$, and (iii) logarithmic policy switching cost for single-agent RL or communication cost for FRL. Additionally, we establish gap-dependent theoretical guarantees for both regret and switching/communication costs, improving or matching the best-known gap-dependent bounds.
翻译:受现实场景中数据收集与策略部署(无论是单智能体还是多智能体)成本高昂的驱动,本研究聚焦于在线策略的单智能体强化学习(RL)与联邦强化学习(FRL),重点关注最小化启动成本(达到接近最优遗憾所需的样本量)以及策略切换或通信成本。在具有$S$个状态和$A$个动作的并行有限时域片段马尔可夫决策过程(MDP)中,现有方法要么需要$S$和$A$的超线性启动成本,要么无法实现对数级的策略切换或通信成本。我们提出了两种新颖的无模型RL算法——Q-EarlySettled-LowCost与FedQ-EarlySettled-LowCost——它们是文献中首次同时实现以下目标的算法:(i)在所有已知无模型RL或FRL算法中达到最佳接近最优遗憾,(ii)具有与$S$和$A$呈线性比例的低启动成本,以及(iii)对单智能体RL实现对数级策略切换成本或对FRL实现对数级通信成本。此外,我们为遗憾及切换/通信成本建立了间隙依赖的理论保证,改进或匹配了已知的最佳间隙依赖界。