We study high-dimensional multi-armed contextual bandits with batched feedback where the $T$ steps of online interactions are divided into $L$ batches. In specific, each batch collects data according to a policy that depends on previous batches and the rewards are revealed only at the end of the batch. Such a feedback structure is popular in applications such as personalized medicine and online advertisement, where the online data often do not arrive in a fully serial manner. We consider high-dimensional and linear settings where the reward function of the bandit model admits either a sparse or low-rank structure and ask how small a number of batches are needed for a comparable performance with fully dynamic data in which $L = T$. For these settings, we design a provably sample-efficient algorithm which achieves a $ \mathcal{\tilde O}(s_0^2 \log^2 T)$ regret in the sparse case and $ \mathcal{\tilde O} ( r ^2 \log^2 T)$ regret in the low-rank case, using only $L = \mathcal{O}( \log T)$ batches. Here $s_0$ and $r$ are the sparsity and rank of the reward parameter in sparse and low-rank cases, respectively, and $ \mathcal{\tilde O}(\cdot)$ omits logarithmic factors involving the feature dimensions. In other words, our algorithm achieves regret bounds comparable to those in fully sequential setting with only $\mathcal{O}( \log T)$ batches. Our algorithm features a novel batch allocation method that adjusts the batch sizes according to the estimation accuracy within each batch and cumulative regret. Furthermore, we also conduct experiments with synthetic and real-world data to validate our theory.
翻译:我们研究具有批次反馈的高维多臂上下文赌博机问题,其中在线交互的$T$步被划分为$L$个批次。具体而言,每个批次根据依赖于先前批次的策略收集数据,且奖励仅在批次结束时揭示。这种反馈结构在个性化医疗和在线广告等应用中十分常见——这些场景中的在线数据通常并非完全以串行方式到达。我们考虑高维线性设定,其中赌博机模型的奖励函数具有稀疏或低秩结构,并探究与完全动态数据(即$L = T$的情况)相比,实现可比性能所需的最小批次数量。针对这些设定,我们设计了一种可证明样本高效的算法:在稀疏情形下达到$\mathcal{\tilde O}(s_0^2 \log^2 T)$的遗憾界,在低秩情形下达到$\mathcal{\tilde O}(r^2 \log^2 T)$的遗憾界,且仅需$L = \mathcal{O}(\log T)$个批次。这里$s_0$和$r$分别表示稀疏与低秩情形下奖励参数的稀疏度和秩,$\mathcal{\tilde O}(\cdot)$省略了涉及特征维度的对数因子。换言之,我们的算法仅需$\mathcal{O}(\log T)$个批次即可实现与完全顺序设定可比的遗憾界。该算法的核心创新在于一种新型批次分配方法,该方法根据每个批次内的估计精度和累积遗憾动态调整批次大小。最后,我们通过合成数据与真实数据实验验证了理论结果。