Nonstationary phenomena, such as satiation effects in recommendation, are a common feature of sequential decision-making problems. While these phenomena have been mostly studied in the framework of bandits with finitely many arms, in many practically relevant cases linear bandits provide a more effective modeling choice. In this work, we introduce a general framework for the study of nonstationary linear bandits, where current rewards are influenced by the learner's past actions in a fixed-size window. In particular, our model includes stationary linear bandits as a special case. After showing that the best sequence of actions is NP-hard to compute in our model, we focus on cyclic policies and prove a regret bound for a variant of the OFUL algorithm that balances approximation and estimation errors. Our theoretical findings are supported by experiments (which also include misspecified settings) where our algorithm is seen to perform well against natural baselines.
翻译:非平稳现象(如推荐系统中的饱和效应)是序列决策问题中的常见特征。尽管这些现象主要在有限臂赌博机框架中得到研究,但在许多实际场景中,线性赌博机提供了更有效的建模选择。本文提出了一个研究非平稳线性赌博机的通用框架,其中当前奖励受学习者在固定大小窗口内历史行为的影响。特别地,我们的模型将平稳线性赌博机作为特例包含在内。在证明该模型中最优行动序列的计算是NP难问题后,我们聚焦于循环策略,并为平衡逼近误差与估计误差的OFUL算法变体证明了遗憾界。理论结果得到了实验(包括模型误设场景)的支持,实验表明我们的算法在对比自然基线时表现良好。