Given a dataset on actions and resulting long-term rewards, a direct estimation approach fits value functions that minimize prediction error on the training data. Temporal difference learning (TD) methods instead fit value functions by minimizing the degree of temporal inconsistency between estimates made at successive time-steps. Focusing on finite state Markov chains, we provide a crisp asymptotic theory of the statistical advantages of this approach. First, we show that an intuitive inverse trajectory pooling coefficient completely characterizes the percent reduction in mean-squared error of value estimates. Depending on problem structure, the reduction could be enormous or nonexistent. Next, we prove that there can be dramatic improvements in estimates of the difference in value-to-go for two states: TD's errors are bounded in terms of a novel measure - the problem's trajectory crossing time - which can be much smaller than the problem's time horizon.
翻译:给定一个包含动作及其长期奖励的数据集,直接估计方法通过最小化训练数据上的预测误差来拟合价值函数。而时序差分学习方法则通过最小化连续时间步估计值之间的时序不一致性来拟合价值函数。聚焦于有限状态马尔可夫链,我们为该方法统计优势提供了一个清晰的渐近理论框架。首先,我们证明一个直观的逆轨迹池化系数完全刻画了价值估计均方误差的缩减百分比。根据问题结构的不同,这种缩减可能非常显著或完全不存在。其次,我们证明对于两个状态的值函数差值估计可能存在显著改进:时序差分法的误差受限于一个新颖的度量——问题的轨迹穿越时间,该度量可能远小于问题的时间跨度。