基于反馈的安全约束强化学习 (Safety through feedback in Constrained RL)

In safety-critical RL settings, the inclusion of an additional cost function is often favoured over the arduous task of modifying the reward function to ensure the agent's safe behaviour. However, designing or evaluating such a cost function can be prohibitively expensive. For instance, in the domain of self-driving, designing a cost function that encompasses all unsafe behaviours (e.g. aggressive lane changes) is inherently complex. In such scenarios, the cost function can be learned from feedback collected offline in between training rounds. This feedback can be system generated or elicited from a human observing the training process. Previous approaches have not been able to scale to complex environments and are constrained to receiving feedback at the state level which can be expensive to collect. To this end, we introduce an approach that scales to more complex domains and extends to beyond state-level feedback, thus, reducing the burden on the evaluator. Inferring the cost function in such settings poses challenges, particularly in assigning credit to individual states based on trajectory-level feedback. To address this, we propose a surrogate objective that transforms the problem into a state-level supervised classification task with noisy labels, which can be solved efficiently. Additionally, it is often infeasible to collect feedback on every trajectory generated by the agent, hence, two fundamental questions arise: (1) Which trajectories should be presented to the human? and (2) How many trajectories are necessary for effective learning? To address these questions, we introduce \textit{novelty-based sampling} that selectively involves the evaluator only when the the agent encounters a \textit{novel} trajectory. We showcase the efficiency of our method through experimentation on several benchmark Safety Gymnasium environments and realistic self-driving scenarios.

翻译：在安全关键的强化学习场景中，通常倾向于引入额外的成本函数，而非艰难地修改奖励函数来确保智能体的安全行为。然而，设计或评估此类成本函数可能代价极其高昂。例如，在自动驾驶领域，设计一个涵盖所有不安全行为（如激进变道）的成本函数本质上是复杂的。在此类场景中，成本函数可以通过在训练轮次之间离线收集的反馈来学习。这种反馈可以是系统生成的，也可以来自观察训练过程的人类。先前的方法未能扩展到复杂环境，且局限于接收状态级别的反馈，而这种反馈的收集成本可能很高。为此，我们提出了一种可扩展到更复杂领域并超越状态级别反馈的方法，从而减轻评估者的负担。在此类设置中推断成本函数面临挑战，特别是在基于轨迹级别反馈为单个状态分配责任时。为解决此问题，我们提出了一个替代目标，将问题转化为带有噪声标签的状态级别监督分类任务，该任务可以高效求解。此外，对智能体生成的每条轨迹都收集反馈通常不可行，因此产生两个基本问题：（1）应向人类呈现哪些轨迹？以及（2）需要多少轨迹才能实现有效学习？为解决这些问题，我们引入了基于新颖性的采样方法，该方法仅在智能体遇到新颖轨迹时有选择地让评估者参与。我们通过在多个基准安全训练场环境和现实自动驾驶场景中的实验，展示了我们方法的有效性。