We consider the problem of continuous-time policy evaluation. This consists in learning through observations the value function associated with an uncontrolled continuous-time stochastic dynamic and a reward function. We propose two original variants of the well-known TD(0) method using vanishing time steps. One is model-free and the other is model-based. For both methods, we prove theoretical convergence rates that we subsequently verify through numerical simulations. Alternatively, those methods can be interpreted as novel reinforcement learning approaches for approximating solutions of linear PDEs (partial differential equations) or linear BSDEs (backward stochastic differential equations).
翻译:我们考虑连续时间策略评估问题,即通过观测学习与无控连续时间随机动态及奖励函数相关的值函数。针对采用趋零时间步长的经典TD(0)方法,我们提出两种原始变体:一种为无模型方法,另一种为基于模型的方法。对于这两种方法,我们证明其理论收敛速率,并通过数值仿真加以验证。此外,这些方法可被解释为近似求解线性偏微分方程或线性倒向随机微分方程的新型强化学习途径。