This work uses the entropy-regularised relaxed stochastic control perspective as a principled framework for designing reinforcement learning (RL) algorithms. Herein agent interacts with the environment by generating noisy controls distributed according to the optimal relaxed policy. The noisy policies on the one hand, explore the space and hence facilitate learning but, on the other hand, introduce bias by assigning a positive probability to non-optimal actions. This exploration-exploitation trade-off is determined by the strength of entropy regularisation. We study algorithms resulting from two entropy regularisation formulations: the exploratory control approach, where entropy is added to the cost objective, and the proximal policy update approach, where entropy penalises policy divergence between consecutive episodes. We focus on the finite horizon continuous-time linear-quadratic (LQ) RL problem, where a linear dynamics with unknown drift coefficients is controlled subject to quadratic costs. In this setting, both algorithms yield a Gaussian relaxed policy. We quantify the precise difference between the value functions of a Gaussian policy and its noisy evaluation and show that the execution noise must be independent across time. By tuning the frequency of sampling from relaxed policies and the parameter governing the strength of entropy regularisation, we prove that the regret, for both learning algorithms, is of the order $\mathcal{O}(\sqrt{N}) $ (up to a logarithmic factor) over $N$ episodes, matching the best known result from the literature.
翻译:本研究利用熵正则化松弛随机控制视角作为设计强化学习算法的原则性框架。在该框架下,智能体通过生成服从最优松弛策略分布的带噪声控制信号与环境交互。带噪声策略一方面探索状态空间从而促进学习,但另一方面通过为非最优动作分配正概率而引入偏差。这种探索-利用权衡由熵正则化强度决定。我们研究了两种熵正则化公式衍生出的算法:探索控制方法(将熵加入成本目标函数)和近端策略更新方法(利用熵惩罚连续幕次间的策略发散)。我们聚焦于有限时域连续时间线性二次强化学习问题,该问题中具有未知漂移系数的线性动态系统受二次成本约束。在此设定下,两种算法均生成高斯松弛策略。我们量化了高斯策略及其带噪声评估的价值函数之间的精确差异,并证明执行噪声必须在时间上独立。通过调整松弛策略的采样频率与控制熵正则化强度的参数,我们证明两种学习算法在N幕次中的遗憾值(对数因子范围内)均为$\mathcal{O}(\sqrt{N})$量级,这与文献中已知的最优结果一致。