We consider the adversarial linear contextual bandit setting, which allows for the loss functions associated with each of $K$ arms to change over time without restriction. Assuming the $d$-dimensional contexts are drawn from a fixed known distribution, the worst-case expected regret over the course of $T$ rounds is known to scale as $\tilde O(\sqrt{Kd T})$. Under the additional assumption that the density of the contexts is log-concave, we obtain a second-order bound of order $\tilde O(K\sqrt{d V_T})$ in terms of the cumulative second moment of the learner's losses $V_T$, and a closely related first-order bound of order $\tilde O(K\sqrt{d L_T^*})$ in terms of the cumulative loss of the best policy $L_T^*$. Since $V_T$ or $L_T^*$ may be significantly smaller than $T$, these improve over the worst-case regret whenever the environment is relatively benign. Our results are obtained using a truncated version of the continuous exponential weights algorithm over the probability simplex, which we analyse by exploiting a novel connection to the linear bandit setting without contexts.
翻译:我们考虑对抗性线性上下文赌博机设定,该设定允许与$K$个臂相关的损失函数随时间无限制地变化。在假设$d$维上下文来自固定已知分布的前提下,已知$T$轮过程中的最坏情况期望遗憾规模为$\tilde O(\sqrt{Kd T})$。进一步假设上下文密度为对数凹时,我们获得了关于学习者累积损失二阶矩$V_T$的$\tilde O(K\sqrt{d V_T})$阶二阶边界,以及关于最优策略累积损失$L_T^*$的$\tilde O(K\sqrt{d L_T^*})$阶密切相关一阶边界。由于$V_T$或$L_T^*$可能远小于$T$,这些结果在环境相对温和时优于最坏情况遗憾。我们的成果通过概率单纯形上截断连续指数权重算法实现,并利用与无上下文线性赌博机设定的新联系对算法进行分析。