We study a robust, i.e. in presence of malicious participants, multi-agent multi-armed bandit problem where multiple participants are distributed on a fully decentralized blockchain, with the possibility of some being malicious. The rewards of arms are homogeneous among the honest participants, following time-invariant stochastic distributions, which are revealed to the participants only when certain conditions are met to ensure that the coordination mechanism is secure enough. The coordination mechanism's objective is to efficiently ensure the cumulative rewards gained by the honest participants are maximized. To this end, we are the first to incorporate advanced techniques from blockchains, as well as novel mechanisms, into such a cooperative decision making framework to design optimal strategies for honest participants. This framework allows various malicious behaviors and the maintenance of security and participant privacy. More specifically, we select a pool of validators who communicate to all participants, design a new consensus mechanism based on digital signatures for these validators, invent a UCB-based strategy that requires less information from participants through secure multi-party computation, and design the chain-participant interaction and an incentive mechanism to encourage participants' participation. Notably, we are the first to prove the theoretical regret of the proposed algorithm and claim its optimality. Unlike existing work that integrates blockchains with learning problems such as federated learning which mainly focuses on optimality via computational experiments, we demonstrate that the regret of honest participants is upper bounded by $\log{T}$ under certain assumptions. The regret bound is consistent with the multi-agent multi-armed bandit problem, both without malicious participants and with purely Byzantine attacks which do not affect the entire system.
翻译:本研究探讨了一个鲁棒性(即存在恶意参与者情况下)的多智能体多臂老虎机问题,其中多个参与者分布在完全去中心化的区块链上,且可能存在恶意参与者。对于诚实参与者而言,各臂的奖励服从时不变随机分布且具有同质性,这些分布仅当满足特定条件时才会向参与者揭示,以确保协调机制具备足够的安全性。该协调机制的目标是有效确保诚实参与者获得的累积奖励最大化。为此,我们首次将区块链的先进技术及新型机制融入此类协同决策框架,为诚实参与者设计最优策略。该框架能够应对多种恶意行为,同时维护系统安全性与参与者隐私。具体而言,我们选取一组与所有参与者通信的验证者,基于数字签名为其设计新型共识机制,通过安全多方计算提出一种需要较少参与者信息的UCB策略,并设计链-参与者交互机制及激励方案以促进参与。值得注意的是,我们首次证明了所提算法的理论遗憾界并论证其最优性。与现有将区块链与联邦学习等学习问题结合(主要通过计算实验验证最优性)的研究不同,我们在特定假设下证明诚实参与者的遗憾上界为$\log{T}$。该遗憾界与无恶意参与者及存在不影响整体系统的纯拜占庭攻击场景下的多智能体多臂老虎机问题保持一致。