As a robot's operational environment and tasks to perform within it grow in complexity, the explicit specification and balancing of optimization objectives to achieve a preferred behavior profile moves increasingly farther out of reach. These systems benefit strongly by being able to align their behavior to reflect human preferences and respond to corrections, but manually encoding this feedback is infeasible. Active preference learning (APL) learns human reward functions by presenting trajectories for ranking. However, existing methods sample from fixed trajectory sets or replay buffers that limit query diversity and often fail to identify informative comparisons. We propose CRED, a novel trajectory generation method for APL that improves reward inference by jointly optimizing environment design and trajectory selection to efficiently query and extract preferences from users. CRED "imagines" new scenarios through environment design and leverages counterfactual reasoning -- by sampling possible rewards from its current belief and asking "What if this were the true preference?" -- to generate trajectory pairs that expose differences between competing reward functions. Comprehensive experiments and a user study show that CRED significantly outperforms state-of-the-art methods in reward accuracy and sample efficiency and receives higher user ratings.
翻译:随着机器人操作环境及其执行任务的复杂性日益增加,为实现期望行为模式而需明确指定和权衡优化目标的做法变得越来越难以实现。这类系统若能使其行为与人类偏好对齐并响应修正指令,将获得显著优势,但手动编码此类反馈并不可行。主动偏好学习(APL)通过呈现轨迹排序来学习人类奖励函数。然而,现有方法从固定轨迹集或回放缓冲区中采样,限制了查询多样性,且往往无法识别具有信息量的比较对。我们提出CRED,一种面向APL的新型轨迹生成方法,该方法通过联合优化环境设计与轨迹选择来改进奖励推断,从而高效地向用户查询并提取偏好信息。CRED通过环境设计"构想"新场景,并利用反事实推理——通过从当前信念中采样可能的奖励函数并追问"若此为真实偏好会如何?"——生成能揭示竞争奖励函数间差异的轨迹对。综合实验与用户研究表明,CRED在奖励函数准确性和样本效率上显著优于现有最优方法,并获得更高的用户评分。