A key challenge in reward learning from human input is that desired agent behavior often changes based on context. For example, a robot must adapt to avoid a stove once it becomes hot. We observe that while high-level preferences (e.g., prioritizing safety over efficiency) often remain constant, context alters the $\textit{saliency}$--or importance--of reward features. For instance, stove heat changes the relevance of the robot's proximity, not the underlying preference for safety. Moreover, these contextual effects recur across tasks, motivating the need for transferable representations to encode them. Existing multi-task and meta-learning methods simultaneously learn representations and task preferences, at best $\textit{implicitly}$ capturing contextual effects and requiring substantial data to separate them from task-specific preferences. Instead, we propose $\textit{explicitly}$ modeling and learning context-dependent feature saliency separately from context-invariant preferences. We introduce $\textit{calibrated features}$--modular representations that capture contextual effects on feature saliency--and present specialized paired comparison queries that isolate saliency from preference for efficient learning. Simulated experiments show our method improves sample efficiency, requiring 10x fewer preference queries than baselines to achieve equivalent reward accuracy, with up to 15% better performance in low-data regimes (5-10 queries). An in-person user study (N=12) demonstrates that participants can effectively teach their personal contextual preferences with our method, enabling adaptable and personalized reward learning.
翻译:从人类输入中学习奖励的一个关键挑战在于,期望的智能体行为通常会随上下文而变化。例如,机器人必须在炉灶变热时调整行为以避开它。我们观察到,虽然高层级偏好(如优先考虑安全性而非效率)通常保持不变,但上下文会改变奖励特征的$\textit{显著性}$(即重要性)。例如,炉灶的热度改变了机器人接近度这一特征的相关性,而非对安全性的根本偏好。此外,这些上下文效应在不同任务中反复出现,这促使我们需要可迁移的表征来编码它们。现有的多任务和元学习方法同时学习表征和任务偏好,最多只能$\textit{隐式地}$捕获上下文效应,并且需要大量数据才能将其与任务特定偏好区分开来。相反,我们提出将上下文相关的特征显著性与上下文不变的偏好$\textit{显式地}$分开建模和学习。我们引入了$\textit{校准特征}$——一种模块化表征,用于捕获上下文对特征显著性的影响——并提出专门的配对比较查询,以将显著性与偏好分离,从而实现高效学习。模拟实验表明,我们的方法提高了样本效率,在达到同等奖励准确度时,比基线方法所需的偏好查询数量少10倍,在低数据量(5-10次查询)情况下性能提升高达15%。一项现场用户研究(N=12)表明,参与者能够使用我们的方法有效教授其个人上下文偏好,从而实现适应性强且个性化的奖励学习。