Learning General Continuous Constraint from Demonstrations via Positive-Unlabeled Learning

Planning for a wide range of real-world tasks necessitates to know and write all constraints. However, instances exist where these constraints are either unknown or challenging to specify accurately. A possible solution is to infer the unknown constraints from expert demonstration. The majority of prior works limit themselves to learning simple linear constraints, or require strong knowledge of the true constraint parameterization or environmental model. To mitigate these problems, this paper presents a positive-unlabeled (PU) learning approach to infer a continuous, arbitrary and possibly nonlinear, constraint from demonstration. From a PU learning view, We treat all data in demonstrations as positive (feasible) data, and learn a (sub)-optimal policy to generate high-reward-winning but potentially infeasible trajectories, which serve as unlabeled data containing both feasible and infeasible states. Under an assumption on data distribution, a feasible-infeasible classifier (i.e., constraint model) is learned from the two datasets through a postprocessing PU learning technique. The entire method employs an iterative framework alternating between updating the policy, which generates and selects higher-reward policies, and updating the constraint model. Additionally, a memory buffer is introduced to record and reuse samples from previous iterations to prevent forgetting. The effectiveness of the proposed method is validated in two Mujoco environments, successfully inferring continuous nonlinear constraints and outperforming a baseline method in terms of constraint accuracy and policy safety.

翻译：规划广泛的实际任务需要了解并设定所有约束条件。然而，在某些情况下，这些约束要么未知，要么难以精确描述。一种可能的解决方案是从专家演示中推断未知约束。先前大多数研究局限于学习简单的线性约束，或要求对真实约束参数化或环境模型具有先验知识。为缓解这些问题，本文提出一种正例-无标签学习（PU learning）方法，用于从演示中推断连续、任意且可能非线性的约束。从正例-无标签学习的视角，我们将演示中的所有数据视为正例（可行）数据，并通过学习（次）最优策略来生成高奖励但可能不可行的轨迹，这些轨迹作为同时包含可行与不可行状态的无标签数据。在数据分布假设下，通过后处理式正例-无标签学习技术，从两个数据集中学习可行-不可行分类器（即约束模型）。整个方法采用迭代框架，交替更新策略（用于生成和选择更高奖励的策略）与更新约束模型。此外，引入记忆缓冲区记录并复用先前迭代的样本以防止遗忘。所提方法的有效性在两个Mujoco环境中得到验证，成功推断出连续非线性约束，并在约束精度与策略安全性方面优于基线方法。