Offline Reinforcement Learning (RL) is structured to derive policies from static trajectory data without requiring real-time environment interactions. Recent studies have shown the feasibility of framing offline RL as a sequence modeling task, where the sole aim is to predict actions based on prior context using the transformer architecture. However, the limitation of this single task learning approach is its potential to undermine the transformer model's attention mechanism, which should ideally allocate varying attention weights across different tokens in the input context for optimal prediction. To address this, we reformulate offline RL as a multi-objective optimization problem, where the prediction is extended to states and returns. We also highlight a potential flaw in the trajectory representation used for sequence modeling, which could generate inaccuracies when modeling the state and return distributions. This is due to the non-smoothness of the action distribution within the trajectory dictated by the behavioral policy. To mitigate this issue, we introduce action space regions to the trajectory representation. Our experiments on D4RL benchmark locomotion tasks reveal that our propositions allow for more effective utilization of the attention mechanism in the transformer model, resulting in performance that either matches or outperforms current state-of-the art methods.
翻译:离线强化学习(Offline RL)旨在从静态轨迹数据中推导策略,无需与环境进行实时交互。近期研究表明,将离线强化学习建模为序列预测任务具有可行性,其目标仅是利用变换器架构基于先验上下文预测动作。然而,这种单一任务学习方法的局限性在于可能削弱变换器模型的注意力机制——该机制本应为输入上下文中不同词元分配差异化的注意力权重以实现最优预测。针对此问题,我们将离线强化学习重构为多目标优化问题,将预测范围扩展至状态与回报。同时,我们发现用于序列建模的轨迹表示存在潜在缺陷,可能在建模状态与回报分布时产生误差,这是由于行为策略导致的轨迹内动作分布非平滑性所致。为缓解该问题,我们在轨迹表示中引入动作空间区域。在D4RL基准运动任务上的实验表明,我们的方法能更有效地利用变换器模型的注意力机制,取得与现有最优方法相当或更优的性能。