We study cooperative stochastic multi-armed bandits with vector-valued rewards under adversarial corruption and limited verification. In each of $T$ rounds, each of $N$ agents selects an arm, the environment generates a clean reward vector, and an adversary perturbs the observed feedback subject to a global corruption budget $Γ$. Performance is measured by team regret under a coordinate-wise nondecreasing, $L$-Lipschitz scalarization $φ$, covering linear, Chebyshev, and smooth monotone utilities. Our main contribution is a communication-corruption coupling: we show that a fixed environment-side budget $Γ$ can translate into an effective corruption level ranging from $Γ$ to $NΓ$, depending on whether agents share raw samples, sufficient statistics, or only arm recommendations. We formalize this via a protocol-induced multiplicity functional and prove regret bounds parameterized by the resulting effective corruption. As corollaries, raw-sample sharing can suffer an $N$-fold larger additive corruption penalty, whereas summary sharing and recommendation-only sharing preserve an unamplified $O(Γ)$ term and achieve centralized-rate team regret. We further establish information-theoretic limits, including an unavoidable additive $Ω(Γ)$ penalty and a high-corruption regime $Γ=Θ(NT)$ where sublinear regret is impossible without clean information. Finally, we characterize how a global budget $ν$ of verified observations restores learnability. That is, verification is necessary in the high-corruption regime, and sufficient once it crosses the identification threshold, with certified sharing enabling the team's regret to become independent of $Γ$.
翻译:本研究探讨了在对抗性腐败和有限验证条件下的合作随机多臂赌博机问题,其奖励为向量形式。在$T$轮中的每一轮,$N$个智能体各自选择一个臂,环境生成一个干净奖励向量,而对手根据全局腐败预算$Γ$对观测到的反馈进行扰动。性能通过团队后悔值来衡量,该后悔值基于坐标非递减、$L$-Lipschitz标量化函数$φ$,涵盖了线性、切比雪夫和平滑单调效用。我们的主要贡献是揭示了通信与腐败之间的耦合关系:我们证明,一个固定的环境侧预算$Γ$可以转化为从$Γ$到$NΓ$不等的有效腐败水平,具体取决于智能体是共享原始样本、充分统计量还是仅共享臂推荐。我们通过协议诱导的乘数泛函形式化这一现象,并证明了以所得有效腐败为参数的后悔界。作为推论,原始样本共享可能遭受$N$倍更大的加性腐败惩罚,而摘要共享和仅推荐共享则保留了未放大的$O(Γ)$项,并实现了集中式速率的团队后悔。我们进一步建立了信息论极限,包括不可避免的加性$Ω(Γ)$惩罚,以及在$Γ=Θ(NT)$的高腐败区域,若无干净信息则无法实现次线性后悔。最后,我们刻画了全局验证观测预算$ν$如何恢复可学习性。即,在高腐败区域验证是必要的,且一旦超过识别阈值,验证就是充分的,经过认证的共享能使团队后悔变得独立于$Γ$。