Offline reinforcement learning (RL) methods strike a balance between exploration and exploitation by conservative value estimation -- penalizing values of unseen states and actions. Model-free methods penalize values at all unseen actions, while model-based methods are able to further exploit unseen states via model rollouts. However, such methods are handicapped in their ability to find unseen states far away from the available offline data due to two factors -- (a) very short rollout horizons in models due to cascading model errors, and (b) model rollouts originating solely from states observed in offline data. We relax the second assumption and present a novel unseen state augmentation strategy to allow exploitation of unseen states where the learned model and value estimates generalize. Our strategy finds unseen states by value-informed perturbations of seen states followed by filtering out states with epistemic uncertainty estimates too high (high error) or too low (too similar to seen data). We observe improved performance in several offline RL tasks and find that our augmentation strategy consistently leads to overall lower average dataset Q-value estimates i.e. more conservative Q-value estimates than a baseline.
翻译:离线强化学习方法通过保守价值估计(对未见状态与动作的价值进行惩罚)来平衡探索与利用。无模型方法对所有未见动作的价值进行惩罚,而基于模型的方法可通过模型展开进一步利用未见状态。然而,此类方法在发现远离离线数据集的未见状态时存在局限性,原因包括:(a) 模型级联误差导致展开步长极短;(b) 模型展开仅能基于离线数据中的观测状态。本文放宽第二个假设,提出一种新颖的未见状态增强策略,使学习模型与价值估计能够对未见状态进行泛化利用。该策略通过价值驱动的扰动生成未见状态,并根据认知不确定性估计将其过滤:剔除不确定性过高(高误差)或过低(与已知数据过于相似)的状态。在多项离线强化学习任务中,我们观察到性能提升,并发现该增强策略能持续降低数据集平均Q值估计,即比基线方法生成更保守的Q值估计。