The field of quickest change detection (QCD) concerns design and analysis of algorithms to estimate in real time the time at which an important event takes place, and identify properties of the post-change behavior. It is shown in this paper that approaches based on reinforcement learning (RL) can be adapted based on any "surrogate information state" that is adapted to the observations. Hence we are left to choose both the surrogate information state process and the algorithm. For the former, it is argued that there are many choices available, based on a rich theory of asymptotic statistics for QCD. Two approaches to RL design are considered: (i) Stochastic gradient descent based on an actor-critic formulation. Theory is largely complete for this approach: the algorithm is unbiased, and will converge to a local minimum. However, it is shown that variance of stochastic gradients can be very large, necessitating the need for commensurately long run times; (ii) Q-learning algorithms based on a version of the projected Bellman equation. It is shown that the algorithm is stable, in the sense of bounded sample paths, and that a solution to the projected Bellman equation exists under mild conditions. Numerical experiments illustrate these findings, and provide a roadmap for algorithm design in more general settings.
翻译:最快变化检测(QCD)领域研究如何设计与分析算法,以实时估计重要事件发生的时间,并识别变化后行为的特性。本文证明,基于强化学习(RL)的方法可通过任何适应观测数据的“替代信息状态”进行适配。因此,我们需要同时选择替代信息状态过程与算法。对于前者,基于丰富的QCD渐近统计理论,存在多种可选方案。本文考虑两种RL设计方法:(i) 基于演员-评论家框架的随机梯度下降法。该方法的理论基本完备:算法无偏且能收敛至局部最小值。然而,研究表明随机梯度的方差可能极大,需要相应的长运行时间;(ii) 基于投影贝尔曼方程的Q学习算法。研究表明该算法具有稳定性(样本轨迹有界),且在温和条件下投影贝尔曼方程的解存在。数值实验验证了这些发现,并为更一般场景下的算法设计提供了路线图。