This paper introduces the Lagrange Policy for Continuous Actions (LPCA), a reinforcement learning algorithm specifically designed for weakly coupled MDP problems with continuous action spaces. LPCA addresses the challenge of resource constraints dependent on continuous actions by introducing a Lagrange relaxation of the weakly coupled MDP problem within a neural network framework for Q-value computation. This approach effectively decouples the MDP, enabling efficient policy learning in resource-constrained environments. We present two variations of LPCA: LPCA-DE, which utilizes differential evolution for global optimization, and LPCA-Greedy, a method that incrementally and greadily selects actions based on Q-value gradients. Comparative analysis against other state-of-the-art techniques across various settings highlight LPCA's robustness and efficiency in managing resource allocation while maximizing rewards.
翻译:本文介绍了连续动作拉格朗日策略(LPCA),这是一种专为具有连续动作空间的弱耦合马尔可夫决策过程问题设计的强化学习算法。LPCA通过在用于Q值计算的神经网络框架内引入弱耦合MDP问题的拉格朗日松弛,解决了依赖于连续动作的资源约束挑战。该方法有效地解耦了MDP,从而能够在资源受限的环境中实现高效策略学习。我们提出了LPCA的两种变体:LPCA-DE(利用差分进化进行全局优化)和LPCA-Greedy(一种基于Q值梯度增量式贪婪选择动作的方法)。在各种设置下与其他先进技术的对比分析突显了LPCA在管理资源分配同时最大化奖励方面的鲁棒性和高效性。