We consider stochastic non-stationary linear bandits where the linear parameter connecting contexts to the reward changes over time. Existing algorithms in this setting localize the policy by gradually discarding or down-weighting past data, effectively shrinking the time horizon over which learning can occur. However, in many settings historical data may still carry partial information about the reward model. We propose to leverage such data while adapting to changes, by assuming the reward model decomposes into stationary and non-stationary components. Based on this assumption, we introduce ISD-linUCB, an algorithm that uses past data to learn invariances in the reward model and subsequently exploits them to improve online performance. We show both theoretically and empirically that leveraging invariance reduces the problem dimensionality, yielding significant regret improvements in fast-changing environments when sufficient historical data is available.
翻译:本文研究随机非平稳线性赌博机问题,其中将上下文与奖励关联的线性参数随时间变化。该场景下的现有算法通过逐步丢弃或降低历史数据权重来定位策略,实质上缩短了可进行学习的时间范围。然而,在许多场景中,历史数据仍可能携带关于奖励模型的部分信息。我们提出在适应变化的同时利用此类数据,其核心假设是奖励模型可分解为平稳与非平稳分量。基于该假设,我们提出了ISD-linUCB算法,该算法利用历史数据学习奖励模型中的不变性特征,进而利用这些特征提升在线性能。我们从理论与实验两方面证明,在可获得充足历史数据的快速变化环境中,利用不变性能够降低问题维度,从而显著改善遗憾界。