Deep reinforcement learning (DRL) requires the collection of interventional data, which is sometimes expensive and even unethical in the real world, such as in the autonomous driving and the medical field. Offline reinforcement learning promises to alleviate this issue by exploiting the vast amount of observational data available in the real world. However, observational data may mislead the learning agent to undesirable outcomes if the behavior policy that generates the data depends on unobserved random variables (i.e., confounders). In this paper, we propose two deconfounding methods in DRL to address this problem. The methods first calculate the importance degree of different samples based on the causal inference technique, and then adjust the impact of different samples on the loss function by reweighting or resampling the offline dataset to ensure its unbiasedness. These deconfounding methods can be flexibly combined with existing model-free DRL algorithms such as soft actor-critic and deep Q-learning, provided that a weak condition can be satisfied by the loss functions of these algorithms. We prove the effectiveness of our deconfounding methods and validate them experimentally.
翻译:深度强化学习(DRL)需要收集干预性数据,这在现实世界中(如自动驾驶和医疗领域)可能成本高昂甚至违反伦理。离线强化学习通过利用现实世界中大量可用的观测数据,有望缓解这一问题。然而,若生成数据的行动策略依赖于未观测到的随机变量(即混杂因子),观测数据可能会误导学习智能体产生不良结果。本文针对该问题提出了两种DRL中的去混杂方法。这两种方法首先基于因果推断技术计算不同样本的重要程度,然后通过对离线数据集进行重加权或重采样来调整不同样本对损失函数的影响,从而确保其无偏性。这些去混杂方法能够灵活地与现有的无模型DRL算法(如软演员-评论家算法和深度Q学习)结合使用,只需这些算法的损失函数满足一个弱条件即可。我们证明了所提去混杂方法的有效性,并通过实验进行了验证。