This paper studies the q-learning, recently coined as the continuous-time counterpart of Q-learning by Jia and Zhou (2022c), for continuous time Mckean-Vlasov control problems in the setting of entropy-regularized reinforcement learning. In contrast to the single agent's control problem in Jia and Zhou (2022c), the mean-field interaction of agents render the definition of q-function more subtle, for which we reveal that two distinct q-functions naturally arise: (i) the integrated q-function (denoted by $q$) as the first-order approximation of the integrated Q-function introduced in Gu, Guo, Wei and Xu (2023) that can be learnt by a weak martingale condition involving test policies; and (ii) the essential q-function (denoted by $q_e$) that is employed in the policy improvement iterations. We show that two q-functions are related via an integral representation under all test policies. Based on the weak martingale condition of the integrated q-function and our proposed searching method of test policies, some model-free offline and online learning algorithms are devised. In two financial applications, one in LQ control framework and one beyond LQ control framework, we can obtain the exact parameterization of the value function and two q-functions and illustrate our algorithms with simulation experiments.
翻译:本文研究了连续时间McKean-Vlasov控制问题在熵正则化强化学习框架下的q学习,该方法由Jia和Zhou(2022c)近期提出,作为Q学习在连续时间下的对应物。与Jia和Zhou(2022c)中单一智能体控制问题不同,智能体的平均场交互使得q函数的定义更为微妙。我们揭示出两种截然不同的q函数自然产生:(i)积分q函数(记为$q$),作为Gu、Guo、Wei和Xu(2023)引入的积分Q函数的一阶近似,可通过涉及测试策略的弱鞅条件进行学习;以及(ii)本质q函数(记为$q_e$),用于策略改进迭代中。我们证明,在所有测试策略下,这两个q函数通过一个积分表示相互关联。基于积分q函数的弱鞅条件及我们提出的测试策略搜索方法,设计出若干无模型的离线和在线学习算法。在两个金融应用场景中——一个在LQ控制框架内,另一个超出LQ控制框架——我们能够获得值函数与两个q函数的精确参数化,并通过仿真实验展示我们的算法。