We study stochastic multi-armed bandits under simultaneous constraints on space and adaptivity: the learner interacts with the environment in $B$ batches and has only $W$ bits of persistent memory. Prior work shows that each constraint alone is surprisingly mild: near-minimax regret $\widetilde{O}(\sqrt{KT})$ is achievable with $O(\log T)$ bits of memory under fully adaptive interaction, and with a $K$-independent $O(\log\log T)$-type number of batches when memory is unrestricted. We show that this picture breaks down in the simultaneously constrained regime. We prove that any algorithm with a $W$-bit memory constraint must use at least $Ω(K/W)$ batches to achieve near-minimax regret $\widetilde{O}(\sqrt{KT})$ , even under adaptive grids. In particular, logarithmic memory rules out $K$-independent batch complexity. Our proof is based on an information bottleneck. We show that near-minimax regret forces the learner to acquire $Ω(K)$ bits of information about the hidden set of good arms under a suitable hard prior, whereas an algorithm with $B$ batches and $W$ bits of memory allows only $O(BW)$ bits of information. A key ingredient is a localized change-of-measure lemma that yields probability-level arm exploration guarantees, which is of independent interest. We also give an algorithm using $O(\log T)$ bits of memory and $\widetilde{O}(K)$ batches that achieves regret $\widetilde{O}(\sqrt{KT})$, which nearly matches our lower bound.
翻译:本文研究了在空间与适应性双重约束下的随机多臂老虎机问题:学习者在$B$个批次内与环境交互,且仅拥有$W$比特的持久存储空间。已有研究表明,单独任一约束的影响都较为轻微:在完全自适应交互下,仅需$O(\log T)$比特内存即可实现接近极小极大遗憾$\widetilde{O}(\sqrt{KT})$;而在内存不受限时,仅需与$K$无关的$O(\log\log T)$量级批次即可达到相同目标。本文证明,在双重约束同时存在时,这一结论不再成立。我们证明,任何具有$W$比特内存约束的算法,即使采用自适应网格,也需要至少$Ω(K/W)$个批次才能实现接近极小极大遗憾$\widetilde{O}(\sqrt{KT})$。特别地,对数级内存排除了与$K$无关的批次复杂度。我们的证明基于信息瓶颈原理:在适当的困难先验分布下,接近极小极大遗憾要求学习者获取$Ω(K)$比特关于隐含优质臂集合的信息,而具有$B$个批次和$W$比特内存的算法仅允许$O(BW)$比特的信息传递。关键工具是一个局部测度变换引理,该引理提供了概率层面的臂探索保证,其本身具有独立研究价值。我们还提出了一种使用$O(\log T)$比特内存和$\widetilde{O}(K)$批次的算法,实现了$\widetilde{O}(\sqrt{KT})$的遗憾,该结果近乎匹配我们的下界。