In real-world streaming recommender systems, user preferences often dynamically change over time (e.g., a user may have different preferences during weekdays and weekends). Existing bandit-based streaming recommendation models only consider time as a timestamp, without explicitly modeling the relationship between time variables and time-varying user preferences. This leads to recommendation models that cannot quickly adapt to dynamic scenarios. To address this issue, we propose a contextual bandit approach using hypernetwork, called HyperBandit, which takes time features as input and dynamically adjusts the recommendation model for time-varying user preferences. Specifically, HyperBandit maintains a neural network capable of generating the parameters for estimating time-varying rewards, taking into account the correlation between time features and user preferences. Using the estimated time-varying rewards, a bandit policy is employed to make online recommendations by learning the latent item contexts. To meet the real-time requirements in streaming recommendation scenarios, we have verified the existence of a low-rank structure in the parameter matrix and utilize low-rank factorization for efficient training. Theoretically, we demonstrate a sublinear regret upper bound against the best policy. Extensive experiments on real-world datasets show that the proposed HyperBandit consistently outperforms the state-of-the-art baselines in terms of accumulated rewards.
翻译:在现实世界的流式推荐系统中,用户偏好常随时间动态变化(例如,用户在平日和周末可能有不同的偏好)。现有的基于赌博机的流式推荐模型仅将时间视为时间戳,而未显式建模时间变量与时变用户偏好之间的关联,导致推荐模型无法快速适应动态场景。针对此问题,我们提出一种利用超网络的上下文赌博机方法,命名为HyperBandit,该方法以时间特征为输入,动态调整推荐模型以适配时变用户偏好。具体而言,HyperBandit维护一个神经网络,该网络能够生成用于估计时变奖励的参数,同时考虑时间特征与用户偏好之间的相关性。通过估计的时变奖励,采用赌博机策略通过学习潜在物品上下文进行在线推荐。为满足流式推荐场景的实时性要求,我们验证了参数矩阵中低秩结构的存在性,并利用低秩分解实现高效训练。理论上,我们证明了相较于最优策略具有次线性遗憾上界。在真实数据集上的大量实验表明,所提出的HyperBandit在累积奖励方面持续优于最先进的基线方法。