We consider the contextual combinatorial bandit setting where in each round, the learning agent, e.g., a recommender system, selects a subset of "arms," e.g., products, and observes rewards for both the individual base arms, which are a function of known features (called "context"), and the super arm (the subset of arms), which is a function of the base arm rewards. The agent's goal is to simultaneously learn the unknown reward functions and choose the highest-reward arms. For example, the "reward" may represent a user's probability of clicking on one of the recommended products. Conventional bandit models, however, employ restrictive reward function models in order to obtain performance guarantees. We make use of deep neural networks to estimate and learn the unknown reward functions and propose Neural UCB Clustering (NeUClust), which adopts a clustering approach to select the super arm in every round by exploiting underlying structure in the context space. Unlike prior neural bandit works, NeUClust uses a neural network to estimate the super arm reward and select the super arm, thus eliminating the need for a known optimization oracle. We non-trivially extend prior neural combinatorial bandit works to prove that NeUClust achieves $\widetilde{O}\left(\widetilde{d}\sqrt{T}\right)$ regret, where $\widetilde{d}$ is the effective dimension of a neural tangent kernel matrix, $T$ the number of rounds. Experiments on real world recommendation datasets show that NeUClust achieves better regret and reward than other contextual combinatorial and neural bandit algorithms.
翻译:我们研究上下文组合赌博机设置,其中每一轮中学习代理(例如推荐系统)选择一个"臂"(例如产品)的子集,并观测到个体基础臂(作为已知特征(称为"上下文")的函数)和超级臂(臂子集,作为基础臂奖励的函数)的奖励。代理的目标是同时学习未知奖励函数并选择最高奖励的臂。例如,"奖励"可能表示用户点击推荐产品之一的概率。然而,传统赌博机模型采用限制性奖励函数模型以获得性能保证。我们利用深度神经网络来估计和学习未知奖励函数,并提出神经上置信界聚类(NeUClust),该方法采用聚类方法,通过利用上下文空间中的底层结构,在每一轮中选择超级臂。与先前的神经赌博机工作不同,NeUClust使用神经网络来估计超级臂奖励并选择超级臂,从而消除了对已知优化预言机的需求。我们非平凡地扩展了先前的神经组合赌博机工作,证明NeUClust实现了$\widetilde{O}\left(\widetilde{d}\sqrt{T}\right)$的遗憾,其中$\widetilde{d}$是神经正切核矩阵的有效维度,$T$为轮数。在真实世界推荐数据集上的实验表明,NeUClust相比其他上下文组合和神经赌博机算法实现了更低的遗憾和更高的奖励。