Most real-world Multi-Robot Task Allocation (MRTA) problems require fast and efficient decision-making, which is often achieved using heuristics-aided methods such as genetic algorithms, auction-based methods, and bipartite graph matching methods. These methods often assume a form that lends better explainability compared to an end-to-end (learnt) neural network based policy for MRTA. However, deriving suitable heuristics can be tedious, risky and in some cases impractical if problems are too complex. This raises the question: can these heuristics be learned? To this end, this paper particularly develops a Graph Reinforcement Learning (GRL) framework to learn the heuristics or incentives for a bipartite graph matching approach to MRTA. Specifically a Capsule Attention policy model is used to learn how to weight task/robot pairings (edges) in the bipartite graph that connects the set of tasks to the set of robots. The original capsule attention network architecture is fundamentally modified by adding encoding of robots' state graph, and two Multihead Attention based decoders whose output are used to construct a LogNormal distribution matrix from which positive bigraph weights can be drawn. The performance of this new bigraph matching approach augmented with a GRL-derived incentive is found to be at par with the original bigraph matching approach that used expert-specified heuristics, with the former offering notable robustness benefits. During training, the learned incentive policy is found to get initially closer to the expert-specified incentive and then slightly deviate from its trend.
翻译:现实世界中的大多数多机器人任务分配(MRTA)问题需要快速高效的决策,通常借助启发式方法实现,例如遗传算法、基于拍卖的方法和二分图匹配方法。与基于端到端(学习型)神经网络的MRTA策略相比,这些方法通常具有更强的可解释性。然而,如果问题过于复杂,推导合适的启发式方法可能繁琐、有风险,甚至在某些情况下不切实际。这引发了一个问题:这些启发式方法能否被学习?为此,本文专门开发了一个图强化学习(GRL)框架,用于学习MRTA中二分图匹配方法的启发式或激励函数。具体地,使用胶囊注意力策略模型来学习如何为连接任务集合与机器人集合的二分图中的任务/机器人配对(边)加权。通过添加机器人状态图的编码,以及两个基于多头注意力的解码器(其输出用于构建一个对数正态分布矩阵,从中可以提取正的大图权重),对原始胶囊注意力网络架构进行了根本性修改。这种结合GRL导出的激励的新的二分图匹配方法的性能,与使用专家指定启发式方法的原始二分图匹配方法相当,且前者具有显著的鲁棒性优势。在训练过程中,发现学习到的激励策略最初会接近专家指定的激励,随后略微偏离其趋势。