Preference-based reinforcement learning (PbRL) shows promise in aligning robot behaviors with human preferences, but its success depends heavily on the accurate modeling of human preferences through reward models. Most methods adopt Markovian assumptions for preference modeling (PM), which overlook the temporal dependencies within robot behavior trajectories that impact human evaluations. While recent works have utilized sequence modeling to mitigate this by learning sequential non-Markovian rewards, they ignore the multimodal nature of robot trajectories, which consist of elements from two distinctive modalities: state and action. As a result, they often struggle to capture the complex interplay between these modalities that significantly shapes human preferences. In this paper, we propose a multimodal sequence modeling approach for PM by disentangling state and action modalities. We introduce a multimodal transformer network, named PrefMMT, which hierarchically leverages intra-modal temporal dependencies and inter-modal state-action interactions to capture complex preference patterns. We demonstrate that PrefMMT consistently outperforms state-of-the-art PM baselines on locomotion tasks from the D4RL benchmark and manipulation tasks from the Meta-World benchmark.
翻译:基于偏好的强化学习(PbRL)在使机器人行为与人类偏好对齐方面展现出潜力,但其成功在很大程度上依赖于通过奖励模型对人类偏好进行精确建模。大多数方法在偏好建模(PM)中采用马尔可夫假设,这忽视了机器人行为轨迹内影响人类评估的时间依赖性。尽管近期研究通过序列建模学习序列非马尔科夫奖励来缓解此问题,但它们忽略了机器人轨迹的多模态本质,这些轨迹由来自两种不同模态的元素组成:状态与动作。因此,这些方法往往难以捕捉显著影响人类偏好的、模态间复杂的相互作用。本文提出了一种通过解耦状态与动作模态的多模态序列建模方法用于PM。我们引入了一个名为PrefMMT的多模态Transformer网络,该网络分层利用模态内时间依赖性和模态间状态-动作交互来捕捉复杂的偏好模式。我们证明,在D4RL基准测试的移动任务和Meta-World基准测试的操作任务上,PrefMMT始终优于最先进的PM基线方法。