Modern cyber-physical systems are becoming increasingly complex to model, thus motivating data-driven techniques such as reinforcement learning (RL) to find appropriate control agents. However, most systems are subject to hard constraints such as safety or operational bounds. Typically, to learn to satisfy these constraints, the agent must violate them systematically, which is computationally prohibitive in most systems. Recent efforts aim to utilize feasibility models that assess whether a proposed action is feasible to avoid applying the agent's infeasible action proposals to the system. However, these efforts focus on guaranteeing constraint satisfaction rather than the agent's learning efficiency. To improve the learning process, we introduce action mapping, a novel approach that divides the learning process into two steps: first learn feasibility and subsequently, the objective by mapping actions into the sets of feasible actions. This paper focuses on the feasibility part by learning to generate all feasible actions through self-supervised querying of the feasibility model. We train the agent by formulating the problem as a distribution matching problem and deriving gradient estimators for different divergences. Through an illustrative example, a robotic path planning scenario, and a robotic grasping simulation, we demonstrate the agent's proficiency in generating actions across disconnected feasible action sets. By addressing the feasibility step, this paper makes it possible to focus future work on the objective part of action mapping, paving the way for an RL framework that is both safe and efficient.
翻译:现代信息物理系统日益复杂,难以建模,这促使人们采用强化学习等数据驱动技术来寻找合适的控制策略。然而,大多数系统都受到安全性或操作范围等硬性约束的限制。通常,为了让智能体学会满足这些约束,它必须系统地违反它们,这在大多数系统中计算成本过高。近期研究致力于利用可行性模型来评估所提出的动作是否可行,以避免将智能体提出的不可行动作应用于系统。但这些研究主要关注保证约束满足,而非智能体的学习效率。为改进学习过程,我们引入了动作映射这一新方法,它将学习过程分为两步:首先学习可行性,随后通过将动作映射到可行动作集合中来学习目标。本文聚焦于可行性部分,通过自监督查询可行性模型来学习生成所有可行动作。我们将问题表述为分布匹配问题,并推导出不同散度的梯度估计器来训练智能体。通过一个示例、一个机器人路径规划场景和一个机器人抓取仿真,我们展示了智能体在生成跨越多个不连通可行动作集的动作方面的能力。通过解决可行性步骤,本文使得未来研究能够专注于动作映射的目标部分,为构建既安全又高效的强化学习框架铺平了道路。