Real-world applications of contextual bandits often exhibit non-stationarity due to seasonality, serendipity, and evolving social trends. While a number of non-stationary contextual bandit learning algorithms have been proposed in the literature, they excessively explore due to a lack of prioritization for information of enduring value, or are designed in ways that do not scale in modern applications with high-dimensional user-specific features and large action set, or both. In this paper, we introduce a novel non-stationary contextual bandit algorithm that addresses these concerns. It combines a scalable, deep-neural-network-based architecture with a carefully designed exploration mechanism that strategically prioritizes collecting information with the most lasting value in a non-stationary environment. Through empirical evaluations on two real-world recommendation datasets, which exhibit pronounced non-stationarity, we demonstrate that our approach significantly outperforms the state-of-the-art baselines.
翻译:现实世界中的上下文赌博机应用常因季节性、偶然性及社会趋势演变而呈现非平稳性。尽管文献中已提出多种非平稳上下文赌博机学习算法,但它们或因缺乏对持久价值信息的优先级排序而导致过度探索,或因设计方式无法在具有高维用户特征和大动作集的现代应用中扩展,抑或二者兼有。本文提出一种新型非平稳上下文赌博机算法以解决上述问题。该算法将可扩展的深度神经网络架构与精心设计的探索机制相结合,该机制能战略性地优先收集非平稳环境中最具持久价值的信息。通过在两个呈现显著非平稳性的真实推荐数据集上进行实证评估,我们证明所提方法显著优于当前最优基线算法。