In this paper, we study the MNL-Bandit problem in a non-stationary environment and present an algorithm with worst-case dynamic regret of $\tilde{O}\left( \min \left\{ \sqrt{NTL}\;,\; N^{\frac{1}{3}}(\Delta_{\infty}^{K})^{\frac{1}{3}} T^{\frac{2}{3}} + \sqrt{NT}\right\}\right)$. Here $N$ is the number of arms, $L$ is the number of switches and $\Delta_{\infty}^K$ is a variation measure of the unknown parameters. We also show that our algorithm is near-optimal (up to logarithmic factors). Our algorithm builds upon the epoch-based algorithm for stationary MNL-Bandit in Agrawal et al. 2016. However, non-stationarity poses several challenges and we introduce new techniques and ideas to address these. In particular, we give a tight characterization for the bias introduced in the estimators due to non stationarity and derive new concentration bounds.
翻译:本文研究了非平稳环境下的MNL-Bandit问题,并提出了一种最坏情况动态遗憾为$\tilde{O}\left( \min \left\{ \sqrt{NTL}\;,\; N^{\frac{1}{3}}(\Delta_{\infty}^{K})^{\frac{1}{3}} T^{\frac{2}{3}} + \sqrt{NT}\right\}\right)$的算法。其中$N$表示臂的数量,$L$表示变化次数,$\Delta_{\infty}^K$为未知参数的变异性度量。我们还证明了该算法在忽略对数因子的意义下是近似最优的。本算法基于Agrawal等人2016年提出的平稳MNL-Bandit分阶段算法,但非平稳性带来了若干挑战,为此我们引入了新的技术与思路予以应对。特别地,我们给出了非平稳性导致估计器偏差的紧刻画,并推导了新的集中不等式。