We consider the problem of finding the optimal value of n in the n-step temporal difference (TD) learning algorithm. We find the optimal n by resorting to a model-free optimization technique involving a one-simulation simultaneous perturbation stochastic approximation (SPSA) based procedure that we adopt to the discrete optimization setting by using a random projection approach. We prove the convergence of our proposed algorithm, SDPSA, using a differential inclusions approach and show that it finds the optimal value of n in n-step TD. Through experiments, we show that the optimal value of n is achieved with SDPSA for arbitrary initial values.
翻译:我们研究在n步时序差分学习算法中寻找最优n值的问题。为确定最优n值,我们采用无模型优化技术,该技术基于单次仿真同步扰动随机逼近(SPSA)过程,并通过随机投影方法将其适配至离散优化场景。我们利用微分包含方法证明了所提算法SDPSA的收敛性,并表明该算法能够找到n步时序差分中的最优n值。实验表明,对于任意初始值,SDPSA均可实现最优n值的获取。