In this article, we propose a novel pessimism-based Bayesian learning method for optimal dynamic treatment regimes in the offline setting. When the coverage condition does not hold, which is common for offline data, the existing solutions would produce sub-optimal policies. The pessimism principle addresses this issue by discouraging recommendation of actions that are less explored conditioning on the state. However, nearly all pessimism-based methods rely on a key hyper-parameter that quantifies the degree of pessimism, and the performance of the methods can be highly sensitive to the choice of this parameter. We propose to integrate the pessimism principle with Thompson sampling and Bayesian machine learning for optimizing the degree of pessimism. We derive a credible set whose boundary uniformly lower bounds the optimal Q-function, and thus we do not require additional tuning of the degree of pessimism. We develop a general Bayesian learning method that works with a range of models, from Bayesian linear basis model to Bayesian neural network model. We develop the computational algorithm based on variational inference, which is highly efficient and scalable. We establish the theoretical guarantees of the proposed method, and show empirically that it outperforms the existing state-of-the-art solutions through both simulations and a real data example.
翻译:本文提出了一种基于悲观原则的贝叶斯学习方法,用于离线场景下的最优动态治疗策略。当覆盖条件不成立时(这在离线数据中较为常见),现有方法会产生次优策略。悲观原则通过抑制对基于状态探索不足的动作进行推荐来解决此问题。然而,几乎所有基于悲观原则的方法都依赖一个衡量悲观度的关键超参数,且方法性能对该参数的选取高度敏感。我们提出将悲观原则与汤普森采样和贝叶斯机器学习相结合,以优化悲观度。通过推导一个边界能一致地约束最优Q函数的可信集,我们无需额外调节悲观度参数。该方法适用于从贝叶斯线性基模型到贝叶斯神经网络模型等多种模型,并采用基于变分推断的高效可扩展计算算法。我们建立了所提方法的理论保证,并通过仿真实验和真实数据实例验证其性能优于现有最优解决方案。