We study high-dimensional multi-armed contextual bandits with batched feedback where the $T$ steps of online interactions are divided into $L$ batches. In specific, each batch collects data according to a policy that depends on previous batches and the rewards are revealed only at the end of the batch. Such a feedback structure is popular in applications such as personalized medicine and online advertisement, where the online data often do not arrive in a fully serial manner. We consider high-dimensional and linear settings where the reward function of the bandit model admits either a sparse or low-rank structure and ask how small a number of batches are needed for a comparable performance with fully dynamic data in which $L = T$. For these settings, we design a provably sample-efficient algorithm which achieves a $ \mathcal{\tilde O}(s_0^2 \log^2 T)$ regret in the sparse case and $ \mathcal{\tilde O} ( r ^2 \log^2 T)$ regret in the low-rank case, using only $L = \mathcal{O}( \log T)$ batches. Here $s_0$ and $r$ are the sparsity and rank of the reward parameter in sparse and low-rank cases, respectively, and $ \mathcal{\tilde O}(\cdot)$ omits logarithmic factors involving the feature dimensions. In other words, our algorithm achieves regret bounds comparable to those in fully sequential setting with only $\mathcal{O}( \log T)$ batches. Our algorithm features a novel batch allocation method that adjusts the batch sizes according to the estimation accuracy within each batch and cumulative regret. Furthermore, we also conduct experiments with synthetic and real-world data to validate our theory.
翻译:我们研究具有批处理反馈的高维多臂上下文赌博机问题,其中$T$步在线交互被划分为$L$个批次。具体而言,每个批次根据依赖于先前批次的策略收集数据,且奖励仅在批次结束时揭示。这种反馈结构在个性化医疗和在线广告等应用中十分普遍,这些场景中的在线数据通常并非完全按顺序到达。我们考虑赌博机模型的奖励函数具有稀疏或低秩结构的高维线性设定,并探究相较于完全动态数据(即$L = T$)的情况,需要多少批次才能达到可比性能。针对这些设定,我们设计了一种可证明样本高效的算法,在稀疏情况下实现$\mathcal{\tilde O}(s_0^2 \log^2 T)$的遗憾值,在低秩情况下实现$\mathcal{\tilde O}(r^2 \log^2 T)$的遗憾值,且仅需$L = \mathcal{O}(\log T)$个批次。其中$s_0$和$r$分别表示稀疏和低秩情况下奖励参数的稀疏度和秩,而$\mathcal{\tilde O}(\cdot)$省略了涉及特征维度的对数项。换言之,我们的算法仅需$\mathcal{O}(\log T)$个批次即可达到与完全顺序设定相当的遗憾界。该算法的核心创新在于一种新颖的批次分配方法,可根据每个批次内的估计精度和累积遗憾动态调整批次大小。此外,我们通过合成数据与真实数据实验验证了理论结果。