We study a system with finitely many groups of multi-action bandit processes, each of which is a Markov decision process (MDP) with finite state and action spaces and potentially different transition matrices when taking different actions. The bandit processes of the same group share the same state and action spaces and, given the same action that is taken, the same transition matrix. All the bandit processes across various groups are subject to multiple weakly coupled constraints over their state and action variables. Unlike the past studies that focused on the offline case, we consider the online case without assuming full knowledge of transition matrices and reward functions a priori and propose an effective scheme that enables simultaneous learning and control. We prove the convergence of the relevant processes in both the timeline and the number of the bandit processes, referred to as the convergence in the time and the magnitude dimensions. Moreover, we prove that the relevant processes converge exponentially fast in the magnitude dimension, leading to exponentially diminishing performance deviation between the proposed online algorithms and offline optimality.
翻译:我们研究一个包含有限多组多动作老虎机过程的系统,每个过程都是一个具有有限状态和动作空间的马尔可夫决策过程(MDP),且采取不同动作时可能具有不同的转移矩阵。同一组内的老虎机过程共享相同的状态和动作空间,并且在采取相同动作时共享相同的转移矩阵。所有不同组的老虎机过程均受到其状态和动作变量上的多个弱耦合约束。与以往研究聚焦于离线情况不同,我们考虑在线情况,不预先假设完全已知转移矩阵和奖励函数,并提出一种能够实现同步学习与控制的有效方案。我们证明了相关过程在时间维度和老虎机过程数量维度上的收敛性,即时间维度与规模维度的收敛。此外,我们证明了相关过程在规模维度上以指数速度收敛,从而导致所提出的在线算法与离线最优性之间的性能偏差呈指数级衰减。