Mixed-motive scenarios are ubiquitous in real-world multi-agent interactions, where self-interested agents often defect for immediate rewards, overlooking the potential of altruistic cooperation to improve long-term gains and collective welfare. Peer punishment can deter defection, but as costly second-order altruism, its persistent imposition may undermine the punisher's interests. Existing approaches often struggle to effectively implement punishment to promote cooperation. To balance the efficacy and cost of punishment, we propose Adaptive Punishment for Cooperation (APC), a distributed method that determines punishment intensity based on both a dynamic punishment probability and the severity of defection. This dynamic probability substantially reduces costly and ineffective punishment while also promotes cooperation. To accurately assess defection and its severity, we use a defection awareness module, whose learning is guided by game reward. Theoretical analysis and empirical results show APC performs effectively in iterated public goods game. Empirically, APC also significantly outperforms existing baselines across sequential social dilemmas, learning rational and effective punishment policies that foster cooperation by strategically deterring defection.
翻译:混合动机场景在现实世界的多智能体交互中普遍存在,其中自私的智能体常因追求即时回报而选择背叛,忽视了利他合作对提升长期收益和集体福利的潜力。同伴惩罚可以遏制背叛行为,但作为高成本的二阶利他主义,其持续施加可能损害惩罚者的利益。现有方法往往难以有效实施惩罚以促进合作。为平衡惩罚的有效性与成本,我们提出促进合作的自适应惩罚机制(APC),这是一种分布式方法,基于动态惩罚概率和背叛严重程度共同确定惩罚强度。该动态概率显著减少了高成本且无效的惩罚,同时促进了合作。为准确评估背叛行为及其严重程度,我们采用了一个由博弈奖励引导学习的背叛感知模块。理论分析与实证结果表明,APC在迭代公共品博弈中表现高效。在实验层面,APC在序贯社会困境中显著优于现有基线方法,学习到了理性且有效的惩罚策略,通过策略性遏制背叛来促进合作。