Reliable uncertainty quantification is crucial for reinforcement learning (RL) in high-stakes settings. We propose a unified conformal prediction framework for infinite-horizon policy evaluation that constructs distribution-free prediction intervals {for returns} in both on-policy and off-policy settings. Our method integrates distributional RL with conformal calibration, addressing challenges such as unobserved returns, temporal dependencies, and distributional shifts. We propose a modular pseudo-return construction based on truncated rollouts and a time-aware calibration strategy using experience replay and weighted subsampling. These innovations mitigate model bias and restore approximate exchangeability, enabling uncertainty quantification even under policy shifts. Our theoretical analysis provides coverage guarantees that account for model misspecification and importance weight estimation. Empirical results, including experiments in synthetic and benchmark environments like Mountain Car, show that our method significantly improves coverage and reliability over standard distributional RL baselines.
翻译:可靠的量化不确定性对于高风险场景下的强化学习至关重要。本文提出了一种用于无限时域策略评估的统一共形预测框架,该框架可在同策略与异策略设置下构建无分布的回报预测区间。我们的方法将分布强化学习与共形校准相结合,解决了未观测回报、时间依赖性和分布偏移等挑战。我们提出了一种基于截断轨迹的模块化伪回报构建方法,以及利用经验回放与加权子采样的时间感知校准策略。这些创新有效缓解了模型偏差并恢复了近似可交换性,使得即使在策略偏移下也能进行不确定性量化。理论分析提供了覆盖保证,该保证考虑了模型设定错误与重要性权重估计。在合成环境及Mountain Car等基准环境中的实验结果表明,本方法在覆盖率和可靠性方面显著优于标准分布强化学习基线。