We investigate the fundamental problem of leveraging offline data to accelerate online reinforcement learning - a direction with strong potential but limited theoretical grounding. Our study centers on how to \emph{learn} and \emph{apply} value envelopes within this context. To this end, we introduce a principled two-stage framework: the first stage uses offline data to derive upper and lower bounds on value functions, while the second incorporates these learned bounds into online algorithms. Our method extends prior work by decoupling the upper and lower bounds, enabling more flexible and tighter approximations. In contrast to approaches that rely on fixed shaping functions, our envelopes are data-driven and explicitly modeled as random variables, with a filtration argument ensuring independence across phases. The analysis establishes high-probability regret bounds determined by two interpretable quantities, thereby providing a formal bridge between offline pre-training and online fine-tuning. Empirical results on tabular MDPs demonstrate substantial regret reductions compared with both UCBVI and prior methods while remaining competitive with related approaches.
翻译:我们研究了利用离线数据加速在线强化学习这一基础问题——该方向潜力巨大但缺乏理论支撑。我们的研究聚焦于如何在此背景下“学习”并“应用”值包络。为此,我们提出了一种原则性的两阶段框架:第一阶段利用离线数据推导值函数的上下界,第二阶段将这些学习到的界融入在线算法。我们的方法通过解耦上下界扩展了先前工作,从而实现了更灵活且更紧的逼近。与依赖固定塑形函数的方法不同,我们的包络是数据驱动的,并显式建模为随机变量,通过滤过论证保证了各阶段间的独立性。分析建立了由两个可解释量决定的高概率遗憾界,从而为离线预训练与在线微调之间提供了正式桥梁。在表格型MDP上的实验结果表明,与UCBVI及先前方法相比,我们的方法显著降低了遗憾值,同时与相关方法保持了竞争力。