We study incentivized exploration for the multi-armed bandit (MAB) problem with non-stationary reward distributions, where players receive compensation for exploring arms other than the greedy choice and may provide biased feedback on the reward. We consider two different non-stationary environments: abruptly-changing and continuously-changing, and propose respective incentivized exploration algorithms. We show that the proposed algorithms achieve sublinear regret and compensation over time, thus effectively incentivizing exploration despite the nonstationarity and the biased or drifted feedback.
翻译:本文研究非平稳收益分布下多臂赌博机(MAB)问题的激励式探索机制,其中玩家因探索非贪婪选择的臂而获得补偿,并可能提供有偏的收益反馈。我们考虑了两种不同的非平稳环境:突变型与渐变型,并分别提出了相应的激励式探索算法。研究表明,所提算法能够随时间实现次线性遗憾与补偿量,从而在非平稳性及有偏或漂移反馈条件下有效激励探索行为。