Offline reinforcement learning endeavors to leverage offline datasets to craft effective agent policy without online interaction, which imposes proper conservative constraints with the support of behavior policies to tackle the out-of-distribution problem. However, existing works often suffer from the constraint conflict issue when offline datasets are collected from multiple behavior policies, i.e., different behavior policies may exhibit inconsistent actions with distinct returns across the state space. To remedy this issue, recent advantage-weighted methods prioritize samples with high advantage values for agent training while inevitably ignoring the diversity of behavior policy. In this paper, we introduce a novel Advantage-Aware Policy Optimization (A2PO) method to explicitly construct advantage-aware policy constraints for offline learning under mixed-quality datasets. Specifically, A2PO employs a conditional variational auto-encoder to disentangle the action distributions of intertwined behavior policies by modeling the advantage values of all training data as conditional variables. Then the agent can follow such disentangled action distribution constraints to optimize the advantage-aware policy towards high advantage values. Extensive experiments conducted on both the single-quality and mixed-quality datasets of the D4RL benchmark demonstrate that A2PO yields results superior to the counterparts. Our code is available at https://github.com/Plankson/A2PO
翻译:离线强化学习致力于利用离线数据集来制定有效的智能体策略,而无需在线交互,这通过行为策略的支持施加适当的保守约束,以解决分布外问题。然而,当离线数据集从多个行为策略收集时,现有方法常常受到约束冲突问题的困扰,即不同的行为策略可能在状态空间中表现出具有不同回报的不一致动作。为了弥补这一问题,近期的优势加权方法优先考虑具有高优势值的样本用于智能体训练,但不可避免地忽视了行为策略的多样性。本文中,我们引入了一种新颖的优势感知策略优化(A2PO)方法,以在混合质量数据集下为离线学习显式构建优势感知策略约束。具体而言,A2PO采用条件变分自编码器,通过将所有训练数据的优势值建模为条件变量,来解耦交织行为策略的动作分布。随后,智能体可以遵循这种解耦的动作分布约束,优化优势感知策略以趋向高优势值。在D4RL基准测试的单质量和混合质量数据集上进行的大量实验表明,A2PO取得了优于同类方法的结果。我们的代码可在 https://github.com/Plankson/A2PO 获取。