Offline reinforcement learning (RL) methods strike a balance between exploration and exploitation by conservative value estimation -- penalizing values of unseen states and actions. Model-free methods penalize values at all unseen actions, while model-based methods are able to further exploit unseen states via model rollouts. However, such methods are handicapped in their ability to find unseen states far away from the available offline data due to two factors -- (a) very short rollout horizons in models due to cascading model errors, and (b) model rollouts originating solely from states observed in offline data. We relax the second assumption and present a novel unseen state augmentation strategy to allow exploitation of unseen states where the learned model and value estimates generalize. Our strategy finds unseen states by value-informed perturbations of seen states followed by filtering out states with epistemic uncertainty estimates too high (high error) or too low (too similar to seen data). We observe improved performance in several offline RL tasks and find that our augmentation strategy consistently leads to overall lower average dataset Q-value estimates i.e. more conservative Q-value estimates than a baseline.
翻译:离线强化学习方法通过保守价值估计(即对未见状态和动作的价值进行惩罚)来平衡探索与利用。无模型方法惩罚所有未见动作的价值,而基于模型的方法可通过模型推演进一步利用未见状态。然而,这类方法在发现远离可用离线数据的未见状态方面存在局限性,原因有二:(a)模型级联误差导致推演范围非常有限;(b)模型推演仅源自离线数据中观测到的状态。我们放宽了第二个假设,提出一种新颖的未见状态增强策略,允许在学习模型和价值估计能够泛化的未见状态上进行利用。该策略通过价值引导扰动已知状态来生成未见状态,随后过滤掉认知不确定性估计过高(误差较大)或过低(与已知数据过于相似)的状态。我们在多个离线强化学习任务中观察到性能提升,并发现我们的增强策略持续降低了数据集的平均Q值估计——即相较于基线获得更保守的Q值估计。