We study stochastic approximation procedures for approximately solving a $d$-dimensional linear fixed point equation based on observing a trajectory of length $n$ from an ergodic Markov chain. We first exhibit a non-asymptotic bound of the order $t_{\mathrm{mix}} \tfrac{d}{n}$ on the squared error of the last iterate of a standard scheme, where $t_{\mathrm{mix}}$ is a mixing time. We then prove a non-asymptotic instance-dependent bound on a suitably averaged sequence of iterates, with a leading term that matches the local asymptotic minimax limit, including sharp dependence on the parameters $(d, t_{\mathrm{mix}})$ in the higher order terms. We complement these upper bounds with a non-asymptotic minimax lower bound that establishes the instance-optimality of the averaged SA estimator. We derive corollaries of these results for policy evaluation with Markov noise -- covering the TD($\lambda$) family of algorithms for all $\lambda \in [0, 1)$ -- and linear autoregressive models. Our instance-dependent characterizations open the door to the design of fine-grained model selection procedures for hyperparameter tuning (e.g., choosing the value of $\lambda$ when running the TD($\lambda$) algorithm).
翻译:我们研究基于遍历马尔可夫链的观测轨迹(长度为n)近似求解d维线性不动点方程的随机逼近过程。首先证明了标准方案最后一次迭代平方误差的非渐近界为$t_{\mathrm{mix}} \tfrac{d}{n}$量级,其中$t_{\mathrm{mix}}$为混合时间。随后对适当平均的迭代序列证明了非渐近实例依赖界,其主导项匹配局部渐近极小化极限,高阶项中参数$(d, t_{\mathrm{mix}})$的依赖关系达到最优。我们通过非渐近极小化下界补充了上述上界,确立了平均SA估计量的实例最优性。将这些结果推广至含马尔可夫噪声的策略评估——涵盖所有$\lambda \in [0, 1)$的TD($\lambda$)算法族——以及线性自回归模型。所建立的实例依赖特征为设计细粒度模型选择程序以进行超参数调优(如运行TD($\lambda$)算法时选择$\lambda$值)开辟了途径。