Generating complex behaviors that satisfy the preferences of non-expert users is a crucial requirement on AI agents. Interactive reward learning from trajectory comparisons is one way to allow non-expert users to convey complex objectives by expressing preferences over short clips of agent behaviors. Even though this parametric method can encode complex tacit knowledge present in the underlying tasks, it implicitly assumes that the human is unable to provide richer feedback than binary preference labels, leading to intolerably high feedback complexity and poor user experience. While providing a detailed symbolic closed-form specification of the objectives might be tempting, it is not always feasible even for an expert user. However, in most cases, humans are aware of how the agent should change its behavior along meaningful axes to fulfill their underlying purpose, even if they are not able to fully specify task objectives symbolically. Using this as motivation, we introduce the notion of Relative Behavioral Attributes, which allows the users to tweak the agent behavior through symbolic concepts (e.g., increasing the softness or speed of agents' movement). We propose two practical methods that can learn to model any kind of behavioral attributes from ordered behavior clips. We demonstrate the effectiveness of our methods on four tasks with nine different behavioral attributes, showing that once the attributes are learned, end users can produce desirable agent behaviors relatively effortlessly, by providing feedback just around ten times. This is over an order of magnitude less than that required by the popular learning-from-human-preferences baselines. The supplementary video and source code are available at: https://guansuns.github.io/pages/rba.
翻译:生成满足非专家用户偏好的复杂行为是人工智能代理的关键需求。通过轨迹比较进行交互式奖励学习,能让非专家用户通过表达对代理行为短片段偏好来传达复杂目标。尽管这种参数化方法能够编码底层任务中存在的复杂隐性知识,但它隐式假设人类无法提供比二元偏好标签更丰富的反馈,从而导致了不可容忍的高反馈复杂性和糟糕的用户体验。虽然提供目标的详细符号封闭式规范可能很有吸引力,但即使对专家用户来说,这也不总是可行的。然而,在大多数情况下,人类知道代理应该如何沿着有意义的轴调整其行为以实现其潜在目的,即使他们无法完全用符号表示任务目标。受此启发,我们引入了相对行为属性的概念,使用户能够通过符号概念(例如,增加代理运动柔和度或速度)来调整代理行为。我们提出了两种实用方法,可以从有序行为片段中学习建模任何类型的行为属性。我们在四个任务上展示了我们方法的有效性,涉及九种不同的行为属性,结果表明,一旦学习到这些属性,最终用户只需提供约十次反馈,就能相对轻松地产生期望的代理行为。这比流行的基于人类偏好学习基线方法所需的反馈量低一个数量级以上。补充视频和源代码可在以下网址获取:https://guansuns.github.io/pages/rba。