Reinforcement learning with outcome-based feedback faces a fundamental challenge: when rewards are only observed at trajectory endpoints, how do we assign credit to the right actions? This paper provides the first comprehensive analysis of this problem in online RL with general function approximation. We develop a provably sample-efficient algorithm achieving $\widetilde{O}({C_{\rm cov} H^3}/{\epsilon^2})$ sample complexity, where $C_{\rm cov}$ is the coverability coefficient of the underlying MDP. By leveraging general function approximation, our approach works effectively in large or infinite state spaces where tabular methods fail, requiring only that value functions and reward functions can be represented by appropriate function classes. Our results also characterize when outcome-based feedback is statistically separated from per-step rewards, revealing an unavoidable exponential separation for certain MDPs. For deterministic MDPs, we show how to eliminate the completeness assumption, dramatically simplifying the algorithm. We further extend our approach to preference-based feedback settings, proving that equivalent statistical efficiency can be achieved even under more limited information. Together, these results constitute a theoretical foundation for understanding the statistical properties of outcome-based reinforcement learning.
翻译:基于结果反馈的强化学习面临一个根本性挑战:当奖励仅在轨迹终点被观测到时,如何将功劳正确地分配给相应动作?本文首次在具有一般函数逼近的在线强化学习框架下对该问题进行了全面分析。我们提出了一种可证明样本高效的算法,实现了$\widetilde{O}({C_{\rm cov} H^3}/{\epsilon^2})$的样本复杂度,其中$C_{\rm cov}$是底层马尔可夫决策过程的覆盖系数。通过利用一般函数逼近,我们的方法能有效处理表格方法失效的大规模或无限状态空间,仅要求价值函数和奖励函数可由适当的函数类表示。我们的结果还刻画了基于结果反馈与每步奖励在统计意义上何时存在分离,揭示了对于某些马尔可夫决策过程不可避免的指数级分离。对于确定性马尔可夫决策过程,我们展示了如何消除完备性假设,从而显著简化算法。我们进一步将方法扩展到基于偏好的反馈场景,证明即使在信息更受限的情况下也能实现同等的统计效率。这些结果共同构成了理解基于结果的强化学习统计特性的理论基础。