Given a dataset on actions and resulting long-term rewards, a direct estimation approach fits value functions that minimize prediction error on the training data. Temporal difference learning (TD) methods instead fit value functions by minimizing the degree of temporal inconsistency between estimates made at successive time-steps. Focusing on finite state Markov chains, we provide a crisp asymptotic theory of the statistical advantages of this approach. First, we show that an intuitive inverse trajectory pooling coefficient completely characterizes the percent reduction in mean-squared error of value estimates. Depending on problem structure, the reduction could be enormous or nonexistent. Next, we prove that there can be dramatic improvements in estimates of the difference in value-to-go for two states: TD's errors are bounded in terms of a novel measure - the problem's trajectory crossing time - which can be much smaller than the problem's time horizon.
翻译:给定一个包含动作及其长期奖励的数据集,直接估计方法通过最小化训练数据上的预测误差来拟合价值函数。时间差分学习(Temporal Difference learning, TD)方法则通过最小化连续时间步估计值之间的时序不一致性来拟合价值函数。聚焦于有限状态马尔可夫链,我们为该方法的统计优势提供了清晰渐近理论。首先,我们证明一个直观的反向轨迹池化系数能完全刻画价值估计均方误差的减少百分比。取决于问题结构,这种减少可能巨大或不存在。其次,我们证明对于两个状态的价值差估计可能存在显著改进:TD的误差受限于一个新型测度——问题的轨迹穿越时间——该测度可能远小于问题的时间跨度。