Reinforcement learning (RL) with continuous state and action spaces remains one of the most challenging problems within the field. Most current learning methods focus on integral identities such as value functions to derive an optimal strategy for the learning agent. In this paper, we instead study the dual form of the original RL formulation to propose the first differential RL framework that can handle settings with limited training samples and short-length episodes. Our approach introduces Differential Policy Optimization (DPO), a pointwise and stage-wise iteration method that optimizes policies encoded by local-movement operators. We prove a pointwise convergence estimate for DPO and provide a regret bound comparable with the best current theoretical derivation. Such pointwise estimate ensures that the learned policy matches the optimal path uniformly across different steps. We then apply DPO to a class of practical RL problems with continuous state and action spaces, and which search for optimal configurations with Lagrangian rewards. DPO is easy to implement, scalable, and shows competitive results on benchmarking experiments against several popular RL methods.
翻译:连续状态与动作空间下的强化学习(RL)仍然是该领域最具挑战性的问题之一。当前大多数学习方法侧重于利用价值函数等积分恒等式来推导智能体的最优策略。本文则通过研究原始强化学习公式的对偶形式,首次提出一种能够处理有限训练样本与短长度回合场景的微分强化学习框架。我们的方法引入了微分策略优化(DPO),这是一种基于局部移动算子编码策略的逐点、分阶段迭代优化方法。我们证明了DPO的逐点收敛性估计,并给出了与当前最佳理论推导相当的遗憾界。这种逐点估计确保了学习策略在不同步骤间能一致地逼近最优路径。随后,我们将DPO应用于一类具有连续状态与动作空间、且需通过拉格朗日奖励搜索最优配置的实际强化学习问题。DPO易于实现、可扩展性强,在基准测试实验中与多种主流强化学习方法相比展现出具有竞争力的结果。