Multi-Armed Bandit (MAB) systems are witnessing an upswing in applications within multi-agent distributed environments, leading to the advancement of collaborative MAB algorithms. In such settings, communication between agents executing actions and the primary learner making decisions can hinder the learning process. A prevalent challenge in distributed learning is action erasure, often induced by communication delays and/or channel noise. This results in agents possibly not receiving the intended action from the learner, subsequently leading to misguided feedback. In this paper, we introduce novel algorithms that enable learners to interact concurrently with distributed agents across heterogeneous action erasure channels with different action erasure probabilities. We illustrate that, in contrast to existing bandit algorithms, which experience linear regret, our algorithms assure sub-linear regret guarantees. Our proposed solutions are founded on a meticulously crafted repetition protocol and scheduling of learning across heterogeneous channels. To our knowledge, these are the first algorithms capable of effectively learning through heterogeneous action erasure channels. We substantiate the superior performance of our algorithm through numerical experiments, emphasizing their practical significance in addressing issues related to communication constraints and delays in multi-agent environments.
翻译:多臂赌博机(Multi-Armed Bandit, MAB)系统在分布式多智能体环境中的应用日益增长,推动了协作式MAB算法的发展。在此类场景中,执行动作的智能体与做出决策的主学习器之间的通信可能阻碍学习过程。分布式学习中的一个常见挑战是动作擦除(action erasure),通常由通信延迟和/或信道噪声引发。这导致智能体可能无法接收到学习器预期的动作,进而产生误导性反馈。本文提出了新颖的算法,使学习器能够跨异构动作擦除信道(具有不同动作擦除概率)与分布式智能体同时交互。我们证明,与线性遗憾(linear regret)的现有赌博机算法不同,我们的算法能确保次线性遗憾(sub-linear regret)保证。所提出的解决方案基于精心设计的重复协议与跨异构信道的学习调度。据我们所知,这是首批能够有效通过异构动作擦除信道进行学习的算法。我们通过数值实验验证了算法的优越性能,强调了它们在应对多智能体环境中通信约束与延迟问题时的实际意义。