We propose a novel master-slave architecture to solve the top-$K$ combinatorial multi-armed bandits problem with non-linear bandit feedback and diversity constraints, which, to the best of our knowledge, is the first combinatorial bandits setting considering diversity constraints under bandit feedback. Specifically, to efficiently explore the combinatorial and constrained action space, we introduce six slave models with distinguished merits to generate diversified samples well balancing rewards and constraints as well as efficiency. Moreover, we propose teacher learning based optimization and the policy co-training technique to boost the performance of the multiple slave models. The master model then collects the elite samples provided by the slave models and selects the best sample estimated by a neural contextual UCB-based network to make a decision with a trade-off between exploration and exploitation. Thanks to the elaborate design of slave models, the co-training mechanism among slave models, and the novel interactions between the master and slave models, our approach significantly surpasses existing state-of-the-art algorithms in both synthetic and real datasets for recommendation tasks. The code is available at: \url{https://github.com/huanghanchi/Master-slave-Algorithm-for-Top-K-Bandits}.
翻译:我们提出了一种新颖的主从式架构,用于解决具有非线性赌博机反馈和多样性约束的Top-$K$组合多臂赌博机问题。据我们所知,这是首个在赌博机反馈下考虑多样性约束的组合赌博机设定。具体而言,为高效探索组合且带约束的动作空间,我们引入了六个具有显著优势的从属模型,以生成能良好平衡奖励、约束与效率的多样化样本。此外,我们提出基于教师学习优化的策略和策略协同训练技术,以提升多个从属模型的性能。随后,主模型收集从属模型提供的精英样本,并通过神经上下文UCB网络估计最佳样本,从而在探索与利用之间进行权衡决策。得益于从属模型的精巧设计、从属模型间的协同训练机制以及主从模型间新颖的交互方式,我们的方法在合成数据集和真实推荐任务数据集中均显著优于现有最先进的算法。代码已开源在:\url{https://github.com/huanghanchi/Master-slave-Algorithm-for-Top-K-Bandits}。