Model-free reinforcement learning methods lack an inherent mechanism to impose behavioural constraints on the trained policies. While certain extensions exist, they remain limited to specific types of constraints, such as value constraints with additional reward signals or visitation density constraints. In this work we try to unify these existing techniques and bridge the gap with classical optimization and control theory, using a generic primal-dual framework for value-based and actor-critic reinforcement learning methods. The obtained dual formulations turn out to be especially useful for imposing additional constraints on the learned policy, as an intrinsic relationship between such dual constraints (or regularization terms) and reward modifications in the primal is reveiled. Furthermore, using this framework, we are able to introduce some novel types of constraints, allowing to impose bounds on the policy's action density or on costs associated with transitions between consecutive states and actions. From the adjusted primal-dual optimization problems, a practical algorithm is derived that supports various combinations of policy constraints that are automatically handled throughout training using trainable reward modifications. The resulting $\texttt{DualCRL}$ method is examined in more detail and evaluated under different (combinations of) constraints on two interpretable environments. The results highlight the efficacy of the method, which ultimately provides the designer of such systems with a versatile toolbox of possible policy constraints.
翻译:无模型强化学习方法缺乏对训练策略施加行为约束的内在机制。虽然存在某些扩展方法,但它们仍局限于特定类型的约束,例如带有额外奖励信号的值约束或访问密度约束。本文尝试统一这些现有技术,并弥合其与经典优化和控制理论之间的差距,采用通用的原始-对偶框架处理基于值函数和基于actor-critic的强化学习方法。由此导出的对偶形式被证明对向学习策略施加额外约束尤为有效,因为这种对偶约束(或正则化项)与原始问题中奖励修改之间的内在关系得以揭示。此外,利用该框架,我们能够引入若干新型约束,允许对策略的动作密度施加界限,或对连续状态与动作之间的转移相关成本施加约束。从调整后的原始-对偶优化问题出发,推导出一种实用算法,该算法支持多种策略约束的组合,并通过可训练的奖励修改在训练过程中自动处理这些约束。由此产生的$\texttt{DualCRL}$方法在更详细的分析后,于两个可解释环境中针对不同(组合的)约束进行了评估。结果突显了该方法的有效性,最终为这类系统的设计者提供了多功能的策略约束工具箱。