Learning Ordinal Response Policies in Rank-Based Stochastic Prize-Collecting Games

The Team Orienteering Problem (TOP) generalizes many real-world multi-agent scheduling and routing tasks that occur in autonomous mobility, aerial logistics, and surveillance applications. While many flavors of the TOP exist for planning in multi-agent systems, they assume that all the agents cooperate toward a single objective; therefore, they do not extend to settings when they compete in reward-scarce environments. We propose Stochastic Prize-Collecting Orienteering Games (SPCOG) as an extension of the TOP to plan in the presence of self-interested agents operating on a graph, under energy constraints and stochastic transitions. A theoretical discussion on complete and star graphs establishes that there is a unique pure Nash equilibrium in SPCOGs that coincides with the optimal routing solution of an equivalent TOP under rank-based conflict resolution. We propose the concept of Ordinal Rank (OR) as a concise representation of an agents' global rank and its location within a topological, well-defined neighborhood. Empirical evaluations conducted on real-world, road-network graphs under both dynamic and stationary prize distributions show that in parameter-sharing settings, the policies that leverage local information can outperform those policies leverage global information when the former is conditioned on the OR rather than the global rank, indicating that the OR acts as a strong inductive bias in multi-agent games on graphs. The OR-conditioned policies also generalize much better to games with large number of agents compared to global-rank conditioned policies. Finally, we also propose we propose Fictitious Ordinal Response Learning (FORL) as an entropy-regulated algorithm to obtain convergent policies in independent-learning settings in prize-collecting games on graphs.

翻译：团队定向问题（TOP）概括了自动驾驶、空中物流和监控应用中出现的许多现实世界多智能体调度与路由任务。尽管多智能体系统的规划中已有多种TOP变体，但它们假设所有智能体协同实现单一目标，因此无法扩展到竞争性奖品稀缺环境。我们提出随机奖品收集定向博弈（SPCOG）作为TOP的扩展，用于规划在能量约束和随机转移条件下，运行在图结构上且具有自利动机的智能体。关于完全图和星图的理论讨论表明，SPCOG中存在唯一的纯纳什均衡，该均衡在基于排序的冲突解决机制下与等价TOP的最优路由解一致。我们提出序数排序（OR）概念，作为智能体全局排序及其在拓扑明确定义邻域内位置的简洁表征。在真实道路网络图上进行的动态与静态奖品分布实证评估表明，参数共享场景中，利用局部信息的策略在基于OR而非全局排序条件化时，可优于利用全局信息的策略，这表明OR在图结构多智能体博弈中起到强归纳偏置作用。与全局排序条件化策略相比，OR条件化策略在大量智能体的博弈中泛化能力更强。最后，我们提出虚构序数响应学习（FORL）作为熵正则化算法，用于在图结构奖品收集博弈的独立学习场景中获得收敛策略。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

随机网络效用最大化在战略排队系统中的博弈论方法

专知会员服务

11+阅读 · 4月13日

《面向大语言模型引导规划、赌徒驱动探索与多智能体导航的分层决策》最新180页

专知会员服务

27+阅读 · 2025年11月17日

《基于图神经网络与强化学习的自主空战决策研究》

专知会员服务

32+阅读 · 2025年5月15日

多智能体博弈中的分布式学习：原理与算法

专知会员服务

54+阅读 · 2024年6月13日