Learning paradigms based purely on offline data as well as those based solely on sequential online learning have been well-studied in the literature. In this paper, we consider combining offline data with online learning, an area less studied but of obvious practical importance. We consider the stochastic $K$-armed bandit problem, where our goal is to identify the arm with the highest mean in the presence of relevant offline data, with confidence $1-\delta$. We conduct a lower bound analysis on policies that provide such $1-\delta$ probabilistic correctness guarantees. We develop algorithms that match the lower bound on sample complexity when $\delta$ is small. Our algorithms are computationally efficient with an average per-sample acquisition cost of $\tilde{O}(K)$, and rely on a careful characterization of the optimality conditions of the lower bound problem.
翻译:仅基于离线数据的学习范式以及仅基于序列在线学习的学习范式在文献中已得到充分研究。本文考虑将离线数据与在线学习相结合这一虽较少研究但具有显著实际意义的方向。我们研究随机K臂赌博机问题,目标是在存在相关离线数据的条件下,以置信度1-δ识别均值最高的臂。针对提供此类1-δ概率正确性保证的策略,我们进行了下界分析。当δ较小时,我们开发的算法在样本复杂度上达到了下界匹配。这些算法计算效率高,平均每样本获取代价为Õ(K),其核心在于对下界问题最优性条件的精细刻画。