We investigate a Bayesian $k$-armed bandit problem in the \emph{many-armed} regime, where $k \geq \sqrt{T}$ and $T$ represents the time horizon. Initially, and aligned with recent literature on many-armed bandit problems, we observe that subsampling plays a key role in designing optimal algorithms; the conventional UCB algorithm is sub-optimal, whereas a subsampled UCB (SS-UCB), which selects $\Theta(\sqrt{T})$ arms for execution under the UCB framework, achieves rate-optimality. However, despite SS-UCB's theoretical promise of optimal regret, it empirically underperforms compared to a greedy algorithm that consistently chooses the empirically best arm. This observation extends to contextual settings through simulations with real-world data. Our findings suggest a new form of \emph{free exploration} beneficial to greedy algorithms in the many-armed context, fundamentally linked to a tail event concerning the prior distribution of arm rewards. This finding diverges from the notion of free exploration, which relates to covariate variation, as recently discussed in contextual bandit literature. Expanding upon these insights, we establish that the subsampled greedy approach not only achieves rate-optimality for Bernoulli bandits within the many-armed regime but also attains sublinear regret across broader distributions. Collectively, our research indicates that in the many-armed regime, practitioners might find greater value in adopting greedy algorithms.
翻译:我们研究贝叶斯$k$臂匪徒问题在\emph{大量臂}场景下的表现,其中$k \geq \sqrt{T}$且$T$表示时间范围。最初,与近期关于大量臂匪徒问题的文献一致,我们观察到子采样在设计最优算法中起关键作用;传统UCB算法是次优的,而一种在UCB框架下选择$\Theta(\sqrt{T})$个臂执行的子采样UCB(SS-UCB)实现了速率最优性。然而,尽管SS-UCB在理论上能保证最优遗憾,它在经验上却不如始终选择经验最优臂的贪心算法表现好。这一观察结果通过真实数据模拟延伸至上下文情境。我们的发现揭示了在大量臂背景下贪心算法受益于一种新型\emph{自由探索},这本质上与臂奖励先验分布的尾部事件相关。这一发现不同于近期上下文匪徒文献中讨论的与协变量变化相关的自由探索概念。基于这些见解,我们证明子采样贪心方法不仅在大量臂场景下对伯努利匪徒问题实现速率最优性,还能在更广泛分布的遗憾上达到次线性。总体而言,我们的研究表明,在大量臂场景下,实践者可能更青睐采用贪心算法。