We study a robust multi-agent multi-armed bandit problem where multiple clients or participants are distributed on a fully decentralized blockchain, with the possibility of some being malicious. The rewards of arms are homogeneous among the clients, following time-invariant stochastic distributions that are revealed to the participants only when the system is secure enough. The system's objective is to efficiently ensure the cumulative rewards gained by the honest participants. To this end and to the best of our knowledge, we are the first to incorporate advanced techniques from blockchains, as well as novel mechanisms, into the system to design optimal strategies for honest participants. This allows various malicious behaviors and the maintenance of participant privacy. More specifically, we randomly select a pool of validators who have access to all participants, design a brand-new consensus mechanism based on digital signatures for these validators, invent a UCB-based strategy that requires less information from participants through secure multi-party computation, and design the chain-participant interaction and an incentive mechanism to encourage participants' participation. Notably, we are the first to prove the theoretical guarantee of the proposed algorithms by regret analyses in the context of optimality in blockchains. Unlike existing work that integrates blockchains with learning problems such as federated learning which mainly focuses on numerical optimality, we demonstrate that the regret of honest participants is upper bounded by $log{T}$. This is consistent with the multi-agent multi-armed bandit problem without malicious participants and the robust multi-agent multi-armed bandit problem with purely Byzantine attacks.
翻译:我们研究了一个鲁棒多智能体多臂赌博机问题,其中多个客户端或参与者分布在一个完全去中心化的区块链上,且可能存在恶意参与者。各臂的奖励在参与者之间是同质的,服从时不变随机分布,且仅当系统足够安全时才向参与者揭示。系统目标在于高效确保诚实参与者获得的累积奖励。为此,据我们所知,我们首次将区块链先进技术及新型机制融入系统中,为诚实参与者设计最优策略。该方法可应对各类恶意行为并维护参与者隐私。具体而言,我们随机选取一个可访问所有参与者的验证者池,为其设计基于数字签名的新型共识机制,提出通过安全多方计算减少参与者信息需求的UCB策略,并设计链-参与者交互机制及激励机制以鼓励参与。值得注意的是,我们首次在区块链最优性背景下通过遗憾分析证明了所提算法的理论保证。与现有将区块链与学习问题(如主要关注数值最优性的联邦学习)相结合的工作不同,我们证明诚实参与者的遗憾上界为$log{T}$。这一结果与不含恶意参与者的多智能体多臂赌博机问题及纯拜占庭攻击下的鲁棒多智能体多臂赌博机问题保持一致。