In traditional statistical learning, data points are usually assumed to be independently and identically distributed (i.i.d.) following an unknown probability distribution. This paper presents a contrasting viewpoint, perceiving data points as interconnected and employing a Markov reward process (MRP) for data modeling. We reformulate the typical supervised learning as an on-policy policy evaluation problem within reinforcement learning (RL), introducing a generalized temporal difference (TD) learning algorithm as a resolution. Theoretically, our analysis draws connections between the solutions of linear TD learning and ordinary least squares (OLS). We also show that under specific conditions, particularly when noises are correlated, the TD's solution proves to be a more effective estimator than OLS. Furthermore, we establish the convergence of our generalized TD algorithms under linear function approximation. Empirical studies verify our theoretical results, examine the vital design of our TD algorithm and show practical utility across various datasets, encompassing tasks such as regression and image classification with deep learning.
翻译:在传统统计学习中,数据点通常被假设为独立同分布(i.i.d.),服从某个未知的概率分布。本文提出了一种对比视角,将数据点视为相互关联,并采用马尔可夫奖励过程(MRP)进行数据建模。我们将典型的监督学习重新表述为强化学习(RL)中的同策略策略评估问题,并引入广义时序差分(TD)学习算法作为解决方案。理论上,我们的分析揭示了线性TD学习与普通最小二乘法(OLS)之间的解的联系。我们还表明,在特定条件下(尤其是噪声相关时),TD的解比OLS的解更有效。此外,我们在线性函数近似下建立了广义TD算法的收敛性。实证研究验证了我们的理论结果,检验了TD算法的重要设计,并展示了其在各种数据集(包括回归和基于深度学习的图像分类任务)中的实用价值。