Safety through feedback in Constrained RL

In safety-critical RL settings, the inclusion of an additional cost function is often favoured over the arduous task of modifying the reward function to ensure the agent's safe behaviour. However, designing or evaluating such a cost function can be prohibitively expensive. For instance, in the domain of self-driving, designing a cost function that encompasses all unsafe behaviours (e.g. aggressive lane changes) is inherently complex. In such scenarios, the cost function can be learned from feedback collected offline in between training rounds. This feedback can be system generated or elicited from a human observing the training process. Previous approaches have not been able to scale to complex environments and are constrained to receiving feedback at the state level which can be expensive to collect. To this end, we introduce an approach that scales to more complex domains and extends to beyond state-level feedback, thus, reducing the burden on the evaluator. Inferring the cost function in such settings poses challenges, particularly in assigning credit to individual states based on trajectory-level feedback. To address this, we propose a surrogate objective that transforms the problem into a state-level supervised classification task with noisy labels, which can be solved efficiently. Additionally, it is often infeasible to collect feedback on every trajectory generated by the agent, hence, two fundamental questions arise: (1) Which trajectories should be presented to the human? and (2) How many trajectories are necessary for effective learning? To address these questions, we introduce \textit{novelty-based sampling} that selectively involves the evaluator only when the the agent encounters a \textit{novel} trajectory. We showcase the efficiency of our method through experimentation on several benchmark Safety Gymnasium environments and realistic self-driving scenarios.

翻译：在安全关键的强化学习场景中，通常倾向于引入额外的代价函数，而非艰难地修改奖励函数来确保智能体的安全行为。然而，设计或评估此类代价函数可能成本极高。例如，在自动驾驶领域，设计一个涵盖所有不安全行为（如激进变道）的代价函数本质上十分复杂。在此类场景中，代价函数可通过在训练轮次间离线收集的反馈进行学习。该反馈可由系统生成，或通过观察训练过程的人类提供。现有方法难以扩展至复杂环境，且受限于接收状态级反馈——这种反馈的收集成本较高。为此，我们提出一种可扩展至更复杂领域并超越状态级反馈的方法，从而减轻评估者的负担。在此类设置中推断代价函数面临挑战，特别是如何基于轨迹级反馈为单个状态分配责任。为解决此问题，我们提出一种替代目标，将原问题转化为含噪声标签的状态级监督分类任务，该任务可被高效求解。此外，对智能体生成的每条轨迹收集反馈通常不可行，因此产生两个核心问题：（1）应向人类呈现哪些轨迹？（2）需要多少轨迹才能实现有效学习？针对这些问题，我们引入基于新颖性的采样方法，该方法仅在智能体遇到新颖轨迹时有选择地让评估者参与。我们通过在多个Safety Gymnasium基准环境和现实自动驾驶场景中的实验，展示了所提方法的有效性。