In this paper, we delve into the statistical analysis of the fitted Q-evaluation (FQE) method, which focuses on estimating the value of a target policy using offline data generated by some behavior policy. We provide a comprehensive theoretical understanding of FQE estimators under both parameteric and nonparametric models on the $Q$-function. Specifically, we address three key questions related to FQE that remain largely unexplored in the current literature: (1) Is the optimal convergence rate for estimating the policy value regarding the sample size $n$ ($n^{-1/2}$) achievable for FQE under a non-parametric model with a fixed horizon ($T$)? (2) How does the error bound depend on the horizon $T$? (3) What is the role of the probability ratio function in improving the convergence of FQE estimators? Specifically, we show that under the completeness assumption of $Q$-functions, which is mild in the non-parametric setting, the estimation errors for policy value using both parametric and non-parametric FQE estimators can achieve an optimal rate in terms of $n$. The corresponding error bounds in terms of both $n$ and $T$ are also established. With an additional realizability assumption on ratio functions, the rate of estimation errors can be improved from $T^{1.5}/\sqrt{n}$ to $T/\sqrt{n}$, which matches the sharpest known bound in the current literature under the tabular setting.
翻译:本文深入探讨了拟合Q评估(FQE)方法的统计分析,该方法专注于利用由某个行为策略生成的离线数据来估计目标策略的价值。我们提供了关于Q函数在参数化和非参数化模型下FQE估计量的全面理论理解。具体而言,我们解决了当前文献中尚未充分探索的与FQE相关的三个关键问题:(1)在固定时间范围(T)的非参数化模型下,FQE能否实现关于样本量n(n^{-1/2})的策略价值估计最优收敛速率?(2)误差界如何依赖于时间范围T?(3)概率比函数在改善FQE估计量收敛性中扮演什么角色?具体来说,我们证明,在Q函数的完备性假设下(该假设在非参数化设定中是温和的),使用参数化和非参数化FQE估计量得到的策略价值估计误差均能实现关于n的最优速率。同时,我们也建立了关于n和T的相应误差界。在额外假设比率函数可实现的条件下,估计误差的速率可以从T^{1.5}/√n提升至T/√n,这与当前文献中表格设定下已知的最尖锐边界相匹配。