We consider a contextual combinatorial bandit problem where in each round a learning agent selects a subset of arms and receives feedback on the selected arms according to their scores. The score of an arm is an unknown function of the arm's feature. Approximating this unknown score function with deep neural networks, we propose algorithms: Combinatorial Neural UCB ($\texttt{CN-UCB}$) and Combinatorial Neural Thompson Sampling ($\texttt{CN-TS}$). We prove that $\texttt{CN-UCB}$ achieves $\tilde{\mathcal{O}}(\tilde{d} \sqrt{T})$ or $\tilde{\mathcal{O}}(\sqrt{\tilde{d} T K})$ regret, where $\tilde{d}$ is the effective dimension of a neural tangent kernel matrix, $K$ is the size of a subset of arms, and $T$ is the time horizon. For $\texttt{CN-TS}$, we adapt an optimistic sampling technique to ensure the optimism of the sampled combinatorial action, achieving a worst-case (frequentist) regret of $\tilde{\mathcal{O}}(\tilde{d} \sqrt{TK})$. To the best of our knowledge, these are the first combinatorial neural bandit algorithms with regret performance guarantees. In particular, $\texttt{CN-TS}$ is the first Thompson sampling algorithm with the worst-case regret guarantees for the general contextual combinatorial bandit problem. The numerical experiments demonstrate the superior performances of our proposed algorithms.
翻译:我们考虑一种上下文组合型臂架问题,其中每轮学习智能体选择一个臂子集,并根据各臂的得分接收所选臂的反馈。臂的得分是其特征的未知函数。通过使用深度神经网络近似该未知得分函数,我们提出算法:组合神经上置信界($\texttt{CN-UCB}$)和组合神经汤普森采样($\texttt{CN-TS}$)。我们证明$\texttt{CN-UCB}$可达到$\tilde{\mathcal{O}}(\tilde{d} \sqrt{T})$或$\tilde{\mathcal{O}}(\sqrt{\tilde{d} T K})$的遗憾界,其中$\tilde{d}$是神经正切核矩阵的有效维度,$K$是臂子集的大小,$T$是时间范围。对于$\texttt{CN-TS}$,我们采用乐观采样技术确保所选组合动作的乐观性,实现最坏情况(频率学派)遗憾界$\tilde{\mathcal{O}}(\tilde{d} \sqrt{TK})$。据我们所知,这是首批具有遗憾性能保证的组合神经臂架算法。特别地,$\texttt{CN-TS}$是首个针对通用上下文组合臂架问题具有最坏情况遗憾保证的汤普森采样算法。数值实验表明我们提出的算法具有卓越性能。