For autonomous agents to successfully integrate into human-centered environments, agents should be able to learn from and adapt to humans in their native settings. Preference-based reinforcement learning (PbRL) is a promising approach that learns reward functions from human preferences. This enables RL agents to adapt their behavior based on human desires. However, humans live in a world full of diverse information, most of which is not relevant to completing a particular task. It becomes essential that agents learn to focus on the subset of task-relevant environment features. Unfortunately, prior work has largely ignored this aspect; primarily focusing on improving PbRL algorithms in standard RL environments that are carefully constructed to contain only task-relevant features. This can result in algorithms that may not effectively transfer to a more noisy real-world setting. To that end, this work proposes R2N (Robust-to-Noise), the first PbRL algorithm that leverages principles of dynamic sparse training to learn robust reward models that can focus on task-relevant features. We study the effectiveness of R2N in the Extremely Noisy Environment setting, an RL problem setting where up to 95% of the state features are irrelevant distractions. In experiments with a simulated teacher, we demonstrate that R2N can adapt the sparse connectivity of its neural networks to focus on task-relevant features, enabling R2N to significantly outperform several state-of-the-art PbRL algorithms in multiple locomotion and control environments.
翻译:为使自主智能体成功融入以人为中心的环境,智能体应具备向人类学习并适应其原生环境的能力。基于偏好的强化学习是一种通过人类偏好学习奖励函数的前沿方法,该方法使强化学习智能体能够根据人类期望调整其行为。然而,人类所处的世界充满多样化信息,其中大部分与完成特定任务无关。智能体必须学会聚焦于任务相关的环境特征子集。遗憾的是,现有研究大多忽视了这一维度,主要集中于在标准强化学习环境中改进偏好强化学习算法——这些环境经过精心构建,仅包含任务相关特征。这可能导致算法无法有效迁移至噪声更强的现实场景。为此,本研究提出首个基于动态稀疏训练原理的偏好强化学习算法R2N,该算法能够学习聚焦于任务相关特征的鲁棒奖励模型。我们在极端噪声环境设置中验证R2N的有效性,该强化学习场景中高达95%的状态特征为无关干扰项。通过模拟教师的实验,我们证明R2N能够通过调整神经网络的稀疏连接来聚焦任务相关特征,进而在多个运动与控制环境中显著超越多种前沿偏好强化学习算法。