Whittle index policy is a heuristic to the intractable restless multi-armed bandits (RMAB) problem. Although it is provably asymptotically optimal, finding Whittle indices remains difficult. In this paper, we present Neural-Q-Whittle, a Whittle index based Q-learning algorithm for RMAB with neural network function approximation, which is an example of nonlinear two-timescale stochastic approximation with Q-function values updated on a faster timescale and Whittle indices on a slower timescale. Despite the empirical success of deep Q-learning, the non-asymptotic convergence rate of Neural-Q-Whittle, which couples neural networks with two-timescale Q-learning largely remains unclear. This paper provides a finite-time analysis of Neural-Q-Whittle, where data are generated from a Markov chain, and Q-function is approximated by a ReLU neural network. Our analysis leverages a Lyapunov drift approach to capture the evolution of two coupled parameters, and the nonlinearity in value function approximation further requires us to characterize the approximation error. Combing these provide Neural-Q-Whittle with $\mathcal{O}(1/k^{2/3})$ convergence rate, where $k$ is the number of iterations.
翻译:惠特尔指数策略是解决棘手的扰动静止多臂老虎机(RMAB)问题的一种启发式方法。尽管该方法已被证明具有渐近最优性,但计算惠特尔指数仍然存在困难。本文提出Neural-Q-Whittle算法——一种基于惠特尔指数的Q学习算法,用于具有神经网络函数逼近的RMAB问题。该算法是非线性双时间尺度随机逼近的实例,其中Q函数值在快时间尺度上更新,而惠特尔指数在慢时间尺度上更新。尽管深度Q学习在经验上取得了成功,但将神经网络与双时间尺度Q学习相结合的Neural-Q-Whittle算法的非渐近收敛速度在很大程度上仍不明确。本文对Neural-Q-Whittle进行了有限时间分析,其中数据由马尔可夫链生成,Q函数由ReLU神经网络近似。该分析利用李雅普诺夫漂移方法来捕捉两个耦合参数的演化过程,而值函数逼近中的非线性特性进一步要求我们刻画逼近误差。综合这些分析,Neural-Q-Whittle实现了$\mathcal{O}(1/k^{2/3})$的收敛速度,其中$k$为迭代次数。