The contextual bandit has been identified as a powerful framework to formulate the recommendation process as a sequential decision-making process, where each item is regarded as an arm and the objective is to minimize the regret of $T$ rounds. In this paper, we study a new problem, Clustering of Neural Bandits, by extending previous work to the arbitrary reward function, to strike a balance between user heterogeneity and user correlations in the recommender system. To solve this problem, we propose a novel algorithm called M-CNB, which utilizes a meta-learner to represent and rapidly adapt to dynamic clusters, along with an informative Upper Confidence Bound (UCB)-based exploration strategy. We provide an instance-dependent performance guarantee for the proposed algorithm that withstands the adversarial context, and we further prove the guarantee is at least as good as state-of-the-art (SOTA) approaches under the same assumptions. In extensive experiments conducted in both recommendation and online classification scenarios, M-CNB outperforms SOTA baselines. This shows the effectiveness of the proposed approach in improving online recommendation and online classification performance.
翻译:上下文赌博机已被确认为一个强大的框架,可将推荐过程表述为一个序贯决策过程,其中每个项目被视为一个臂,目标是最小化$T$轮次的遗憾。在本文中,我们研究了一个新问题——神经赌博机聚类,通过将先前工作扩展到任意奖励函数,以在推荐系统的用户异质性与用户相关性之间取得平衡。为解决此问题,我们提出了一种名为M-CNB的新算法,该算法利用元学习器来表征并快速适应动态聚类,同时结合一种基于信息上置信界(UCB)的探索策略。我们为所提算法提供了能抵御对抗性上下文的实例相关性能保证,并进一步证明在相同假设下,该保证至少与最先进(SOTA)方法相当。在推荐和在线分类场景中进行的大量实验中,M-CNB均优于SOTA基线方法。这表明所提方法在提升在线推荐和在线分类性能方面具有有效性。