We consider a novel multi-arm bandit (MAB) setup, where a learner needs to communicate the actions to distributed agents over erasure channels, while the rewards for the actions are directly available to the learner through external sensors. In our model, while the distributed agents know if an action is erased, the central learner does not (there is no feedback), and thus does not know whether the observed reward resulted from the desired action or not. We propose a scheme that can work on top of any (existing or future) MAB algorithm and make it robust to action erasures. Our scheme results in a worst-case regret over action-erasure channels that is at most a factor of $O(1/\sqrt{1-\epsilon})$ away from the no-erasure worst-case regret of the underlying MAB algorithm, where $\epsilon$ is the erasure probability. We also propose a modification of the successive arm elimination algorithm and prove that its worst-case regret is $\Tilde{O}(\sqrt{KT}+K/(1-\epsilon))$, which we prove is optimal by providing a matching lower bound.
翻译:本文研究一种新颖的多臂赌博机(MAB)设置,其中学习者需要通过擦除信道向分布式智能体传输动作指令,而动作对应的奖励可通过外部传感器直接反馈给学习者。在我们的模型中,虽然分布式智能体能知晓动作是否被擦除,但中央学习者无法获知此信息(无反馈机制),因此无法判断观测到的奖励是否来自预期执行的动作。我们提出一种可兼容任意(现有或未来)MAB算法的框架,使其对动作擦除具有鲁棒性。该框架使动作擦除信道下的最坏情况遗憾值最多为底层MAB算法在无擦除情况下最坏情况遗憾值的$O(1/\sqrt{1-\epsilon})$倍,其中$\epsilon$为擦除概率。我们还改进了连续臂淘汰算法,证明其最坏情况遗憾为$\Tilde{O}(\sqrt{KT}+K/(1-\epsilon))$,并通过给出匹配下界证明该结果的最优性。