Policies often fail due to distribution shift -- changes in the state and reward that occur when a policy is deployed in new environments. Data augmentation can increase robustness by making the model invariant to task-irrelevant changes in the agent's observation. However, designers don't know which concepts are irrelevant a priori, especially when different end users have different preferences about how the task is performed. We propose an interactive framework to leverage feedback directly from the user to identify personalized task-irrelevant concepts. Our key idea is to generate counterfactual demonstrations that allow users to quickly identify possible task-relevant and irrelevant concepts. The knowledge of task-irrelevant concepts is then used to perform data augmentation and thus obtain a policy adapted to personalized user objectives. We present experiments validating our framework on discrete and continuous control tasks with real human users. Our method (1) enables users to better understand agent failure, (2) reduces the number of demonstrations required for fine-tuning, and (3) aligns the agent to individual user task preferences.
翻译:策略常因分布偏移而失效——即当策略部署于新环境时,状态与奖励发生的变化。数据增强可通过使模型对智能体观测中任务无关的变化具有不变性来提升鲁棒性。然而,设计者无法先验地知晓哪些概念是无关的,尤其是当不同最终用户对任务执行方式存在不同偏好时。我们提出一种交互式框架,利用用户的直接反馈来识别个性化的任务无关概念。其核心思想是生成反事实示范,使用户能快速识别可能的任务相关与无关概念。随后,利用任务无关概念的知识进行数据增强,从而获得适配个性化用户目标的策略。我们通过真实人类用户参与的离散与连续控制任务实验验证了该框架。我们的方法能够:(1) 使用户更深入理解智能体失效原因;(2) 减少微调所需的示范数量;(3) 使智能体与用户的个性化任务偏好对齐。