Theory and application of stochastic approximation (SA) has grown within the control systems community since the earliest days of adaptive control. This paper takes a new look at the topic, motivated by recent results establishing remarkable performance of SA with (sufficiently small) constant step-size $\alpha>0$. If averaging is implemented to obtain the final parameter estimate, then the estimates are asymptotically unbiased with nearly optimal asymptotic covariance. These results have been obtained for random linear SA recursions with i.i.d.\ coefficients. This paper obtains very different conclusions in the more common case of geometrically ergodic Markovian disturbance: (i) The \textit{target bias} is identified, even in the case of non-linear SA, and is in general non-zero. The remaining results are established for linear SA recursions: (ii) the bivariate parameter-disturbance process is geometrically ergodic in a topological sense; (iii) the representation for bias has a simpler form in this case, and cannot be expected to be zero if there is multiplicative noise; (iv) the asymptotic covariance of the averaged parameters is within $O(\alpha)$ of optimal. The error term is identified, and may be massive if mean dynamics are not well conditioned. The theory is illustrated with application to TD-learning.
翻译:自自适应控制早期以来,随机逼近(SA)的理论与应用已在控制系统领域不断发展。本文基于近期研究结果——证明了采用(充分小的)固定步长$\alpha>0$的SA具有卓越性能——重新审视该主题。若通过平均化获得最终参数估计,则估计量渐近无偏且具有接近最优的渐近协方差。这些结果已针对具有独立同分布系数的随机线性SA递归获得。本文在更常见的几何遍历马尔可夫扰动情形下得出截然不同的结论:(i) 识别出\textit{目标偏差}(即使在非线性SA情形下),且该偏差一般不为零。其余结果针对线性SA递归建立:(ii) 二元参数-扰动过程在拓扑意义下几何遍历;(iii) 此情形下偏差表示形式更简单,且若存在乘性噪声则预期其不为零;(iv) 平均参数的渐近协方差与最优值的偏差为$O(\alpha)$。本文识别了误差项,若均值动态条件不佳,该误差可能显著。通过应用于TD学习验证了该理论。