Cluster-Based Control of Transition-Independent MDPs

This work studies efficient solution methods for cluster-based control policies of transition-independent Markov decision processes (TI-MDPs). We focus on control of multi-agent systems, whereby a central planner (CP) influences agents to select desirable group behavior. The agents are partitioned into disjoint clusters whereby agents in the same cluster receive the same controls but agents in different clusters may receive different controls. Under mild assumptions, this process can be modeled as a TI-MDP where each factor describes the behavior of one cluster. The action space of the TI-MDP becomes exponential with respect to the number of clusters. To efficiently find a policy in this rapidly scaling space, we propose a clustered Bellman operator that optimizes over the action space for one cluster at any evaluation. We present Clustered Value Iteration (CVI), which uses this operator to iteratively perform "round robin" optimization across the clusters. CVI converges exponentially faster than standard value iteration (VI), and can find policies that closely approximate the MDP's true optimal value. A special class of TI-MDPs with separable reward functions are investigated, and it is shown that CVI will find optimal policies on this class of problems. Finally, the optimal clustering assignment problem is explored. The value functions TI-MDPs with submodular reward functions are shown to be submodular functions, so submodular set optimization may be used to find a near optimal clustering assignment. We propose an iterative greedy cluster splitting algorithm, which yields monotonic submodular improvement in value at each iteration. Finally, simulations offer empirical assessment of the proposed methods.

翻译：本文研究针对转移独立马尔可夫决策过程（TI-MDP）的基于聚类控制策略的高效求解方法。我们聚焦于多智能体系统的控制问题，其中中央规划者（CP）通过影响智能体来选择期望的群体行为。智能体被划分为互不相交的聚类，同一聚类中的智能体接收相同的控制指令，而不同聚类中的智能体可能接收不同的控制指令。在温和假设下，该过程可建模为TI-MDP，其中每个因子描述一个聚类行为。TI-MDP的动作空间随聚类数量呈指数级增长。为了在该快速扩展的空间中高效求解策略，我们提出一种聚类贝尔曼算子，该算子在每次评估时仅针对一个聚类优化其动作空间。我们提出聚类值迭代（CVI）算法，该算子通过交替在各聚类间执行"循环赛"式优化。CVI的收敛速度比标准值迭代（VI）呈指数级加快，并能找到逼近MDP真实最优值的策略。研究了具有可分奖励函数的特殊TI-MDP类，证明CVI能在此类问题上找到最优策略。最后，探讨了最优聚类分配问题。研究表明，具有子模奖励函数的TI-MDP的值函数具有子模性，因此可利用子模集优化来寻找近似最优聚类分配。我们提出一种迭代贪心聚类分裂算法，该算法在每次迭代中能获得单调的子模值改进。最后，通过仿真对所提方法进行实证评估。