In this work, we consider the problem of collaborative multi-user reinforcement learning. In this setting there are multiple users with the same state-action space and transition probabilities but with different rewards. Under the assumption that the reward matrix of the $N$ users has a low-rank structure -- a standard and practically successful assumption in the offline collaborative filtering setting -- the question is can we design algorithms with significantly lower sample complexity compared to the ones that learn the MDP individually for each user. Our main contribution is an algorithm which explores rewards collaboratively with $N$ user-specific MDPs and can learn rewards efficiently in two key settings: tabular MDPs and linear MDPs. When $N$ is large and the rank is constant, the sample complexity per MDP depends logarithmically over the size of the state-space, which represents an exponential reduction (in the state-space size) when compared to the standard ``non-collaborative'' algorithms.
翻译:本文研究了协作式多用户强化学习问题。在该场景中,多个用户共享相同的状态-动作空间与转移概率,但各自拥有不同的奖励函数。在N个用户的奖励矩阵具有低秩结构这一假设下(该假设在离线协同过滤场景中既是标准假设,也具有实际成功应用),我们面临的问题是:能否设计出比针对每个用户单独学习马尔可夫决策过程的算法具有显著更低样本复杂度的算法?我们的主要贡献是提出了一种算法,该算法通过与N个用户特定马尔可夫决策过程进行协作式奖励探索,能够在两种关键场景(表格型马尔可夫决策过程与线性马尔可夫决策过程)中高效学习奖励函数。当N足够大且秩为常数时,每个马尔可夫决策过程的样本复杂度仅随状态空间大小呈对数增长,相较于标准"非协作"算法,这代表状态空间规模的指数级降低。