Learning Optimal and Sample-Efficient Decision Policies with Guarantees

The paradigm of decision-making has been revolutionised by reinforcement learning and deep learning. Although this has led to significant progress in domains such as robotics, healthcare, and finance, the use of RL in practice is challenging, particularly when learning decision policies in high-stakes applications that may require guarantees. Traditional RL algorithms rely on a large number of online interactions with the environment, which is problematic in scenarios where online interactions are costly, dangerous, or infeasible. However, learning from offline datasets is hindered by the presence of hidden confounders. Such confounders can cause spurious correlations in the dataset and can mislead the agent into taking suboptimal or adversarial actions. Firstly, we address the problem of learning from offline datasets in the presence of hidden confounders. We work with instrumental variables (IVs) to identify the causal effect, which is an instance of a conditional moment restrictions (CMR) problem. Inspired by double/debiased machine learning, we derive a sample-efficient algorithm for solving CMR problems with convergence and optimality guarantees, which outperforms state-of-the-art algorithms. Secondly, we relax the conditions on the hidden confounders in the setting of (offline) imitation learning, and adapt our CMR estimator to derive an algorithm that can learn effective imitator policies with convergence rate guarantees. Finally, we consider the problem of learning high-level objectives expressed in linear temporal logic (LTL) and develop a provably optimal learning algorithm that improves sample efficiency over existing methods. Through evaluation on reinforcement learning benchmarks and synthetic and semi-synthetic datasets, we demonstrate the usefulness of the methods developed in this thesis in real-world decision making.

翻译：决策制定的范式已被强化学习和深度学习彻底革新。尽管这在机器人、医疗保健和金融等领域带来了显著进展，但在实际应用中采用强化学习仍面临挑战，尤其是在需要保证的高风险应用中学习决策策略时。传统的强化学习算法依赖于与环境的大量在线交互，这在在线交互成本高昂、危险或不可行的场景中存在问题。然而，从离线数据集中学习会受到隐藏混杂因素的阻碍。此类混杂因素可能导致数据集中的虚假相关性，并误导智能体采取次优或对抗性行动。首先，我们解决了在存在隐藏混杂因素的情况下从离线数据集中学习的问题。我们利用工具变量来识别因果效应，这是条件矩限制问题的一个实例。受双重/去偏机器学习启发，我们推导出一种具有收敛性和最优性保证的样本高效算法来解决条件矩限制问题，其性能优于现有最先进算法。其次，我们在（离线）模仿学习设置中放宽了对隐藏混杂因素的条件，并调整我们的条件矩限制估计器，推导出一种能够学习有效模仿策略且具有收敛速率保证的算法。最后，我们考虑了学习以线性时序逻辑表达的高层目标的问题，并开发了一种可证明最优的学习算法，其样本效率优于现有方法。通过在强化学习基准测试以及合成与半合成数据集上的评估，我们证明了本论文所开发方法在现实世界决策制定中的实用性。