Expressive policies based on flow-matching have been successfully applied in reinforcement learning (RL) more recently due to their ability to model complex action distributions from offline data. These algorithms build on standard policy gradients, which assume that there is no unmeasured confounding in the data. However, this condition does not necessarily hold for pixel-based demonstrations when a mismatch exists between the demonstrator's and the learner's sensory capabilities, leading to implicit confounding biases in offline data. We address the challenge by investigating the problem of confounded observations in offline RL from a causal perspective. We develop a novel causal offline RL objective that optimizes policies' worst-case performance that may arise due to confounding biases. Based on this new objective, we introduce a practical implementation that learns expressive flow-matching policies from confounded demonstrations, employing a deep discriminator to assess the discrepancy between the target policy and the nominal behavioral policy. Experiments across 25 pixel-based tasks demonstrate that our proposed confounding-robust augmentation procedure achieves a success rate 120\% that of confounding-unaware, state-of-the-art offline RL methods.
翻译:基于流匹配的表达性策略因其能够从离线数据中建模复杂动作分布,近年来已成功应用于强化学习领域。这些算法建立在标准策略梯度方法之上,其假设数据中不存在未测量的混杂因素。然而,当演示者与学习者的感知能力存在不匹配时,这一条件对于基于像素的演示任务未必成立,从而导致离线数据中存在隐式的混杂偏差。我们从因果视角出发,研究离线强化学习中混杂观测的挑战。我们提出了一种新颖的因果离线强化学习目标,该目标优化策略在混杂偏差可能引发的最坏情况下的性能。基于这一新目标,我们引入了一种实用实现方法,能够从混杂演示中学习表达性流匹配策略,并采用深度判别器来评估目标策略与名义行为策略之间的差异。在25个基于像素的任务上的实验表明,我们提出的混杂鲁棒增强方法实现了比未考虑混杂因素的最先进离线强化学习方法高120%的成功率。