In this paper, we study the MNL-Bandit problem in a non-stationary environment and present an algorithm with a worst-case expected regret of $\tilde{O}\left( \min \left\{ \sqrt{NTL}\;,\; N^{\frac{1}{3}}(\Delta_{\infty}^{K})^{\frac{1}{3}} T^{\frac{2}{3}} + \sqrt{NT}\right\}\right)$. Here $N$ is the number of arms, $L$ is the number of changes and $\Delta_{\infty}^{K}$ is a variation measure of the unknown parameters. Furthermore, we show matching lower bounds on the expected regret (up to logarithmic factors), implying that our algorithm is optimal. Our approach builds upon the epoch-based algorithm for stationary MNL-Bandit in Agrawal et al. 2016. However, non-stationarity poses several challenges and we introduce new techniques and ideas to address these. In particular, we give a tight characterization for the bias introduced in the estimators due to non stationarity and derive new concentration bounds.
翻译:本文研究了非平稳环境下的MNL-Bandit问题,提出了一种算法,其最坏情况下的期望遗憾为$\tilde{O}\left( \min \left\{ \sqrt{NTL}\;,\; N^{\frac{1}{3}}(\Delta_{\infty}^{K})^{\frac{1}{3}} T^{\frac{2}{3}} + \sqrt{NT}\right\}\right)$,其中$N$是臂数,$L$是变化次数,$\Delta_{\infty}^{K}$是未知参数的变异度量。此外,我们证明了期望遗憾的匹配下界(对数因子除外),表明我们的算法是最优的。我们的方法基于Agrawal等人2016年提出的平稳MNL-Bandit的分段算法。然而,非平稳性带来了若干挑战,我们引入了新技术和思路来应对这些挑战。特别地,我们对非平稳性引起的估计量偏差给出了紧凑刻画,并推导了新的集中界。