Few Batches or Little Memory, But Not Both: Simultaneous Space and Adaptivity Constraints in Stochastic Bandits

We study stochastic multi-armed bandits under simultaneous constraints on space and adaptivity: the learner interacts with the environment in $B$ batches and has only $W$ bits of persistent memory. Prior work shows that each constraint alone is surprisingly mild: near-minimax regret $\widetilde{O}(\sqrt{KT})$ is achievable with $O(\log T)$ bits of memory under fully adaptive interaction, and with a $K$-independent $O(\log\log T)$-type number of batches when memory is unrestricted. We show that this picture breaks down in the simultaneously constrained regime. We prove that any algorithm with a $W$-bit memory constraint must use at least $Ω(K/W)$ batches to achieve near-minimax regret $\widetilde{O}(\sqrt{KT})$, even under adaptive grids. In particular, logarithmic memory rules out $O(K^{1-\varepsilon})$ batch complexity. Our proof is based on an information bottleneck. We show that near-minimax regret forces the learner to acquire $Ω(K)$ bits of information about the hidden set of good arms under a suitable hard prior, whereas an algorithm with $B$ batches and $W$ bits of memory allows only $O(BW)$ bits of information. A key ingredient is a localized change-of-measure lemma that yields probability-level arm exploration guarantees, which is of independent interest. We also give an algorithm that, for any bit budget $W$ with $Ω(\log T) \le W \le O(K\log T)$, uses at most $W$ bits of memory and $\widetilde{O}(K/W)$ batches while achieving regret $\widetilde{O}(\sqrt{KT})$, nearly matching our lower bound up to polylogarithmic factors.

翻译：我们研究在空间与适应性双重约束下的随机多臂赌博机问题：学习器以 $B$ 轮分批方式与环境交互，且仅有 $W$ 比特持久内存。先前研究表明，各约束单独作用时惊人地宽松：在完全自适应交互下，仅需 $O(\log T)$ 比特内存即可实现接近极小极大遗憾 $\widetilde{O}(\sqrt{KT})$；当内存不受限时，使用 $K$ 无关的 $O(\log\log T)$ 型轮次数亦可实现该性能。我们证明这一图景在双重约束场景下将瓦解。我们证明，任何受限于 $W$ 比特内存的算法，即使采用自适应网格，为达到接近极小极大遗憾 $\widetilde{O}(\sqrt{KT})$，至少需要 $\Omega(K/W)$ 轮次。特别地，对数级内存将排除 $O(K^{1-\varepsilon})$ 轮复杂度的可能性。我们的证明基于信息瓶颈原理。我们证明，在合适困难先验下，接近极小极大遗憾迫使学习器获取关于隐藏好臂集合的 $\Omega(K)$ 比特信息，而具有 $B$ 轮和 $W$ 比特内存的算法仅允许 $O(BW)$ 比特信息。关键要素是局部化测度变换引理，能实现概率层面的臂探索保证，该引理具有独立研究价值。我们还提出一种算法：对于任意满足 $\Omega(\log T) \le W \le O(K\log T)$ 的比特预算 $W$，该算法使用不超过 $W$ 比特内存和 $\widetilde{O}(K/W)$ 轮次，同时实现遗憾 $\widetilde{O}(\sqrt{KT})$，与我们的下界在多对数因子内几乎匹配。