Offline reinforcement learning (RL) aims to find optimal policies in dynamic environments in order to maximize the expected total rewards by leveraging pre-collected data. Learning from heterogeneous data is one of the fundamental challenges in offline RL. Traditional methods focus on learning an optimal policy for all individuals with pre-collected data from a single episode or homogeneous batch episodes, and thus, may result in a suboptimal policy for a heterogeneous population. In this paper, we propose an individualized offline policy optimization framework for heterogeneous time-stationary Markov decision processes (MDPs). The proposed heterogeneous model with individual latent variables enables us to efficiently estimate the individual Q-functions, and our Penalized Pessimistic Personalized Policy Learning (P4L) algorithm guarantees a fast rate on the average regret under a weak partial coverage assumption on behavior policies. In addition, our simulation studies and a real data application demonstrate the superior numerical performance of the proposed method compared with existing methods.
翻译:离线强化学习旨在利用预先收集的数据,在动态环境中寻找最优策略,以最大化期望总奖励。从异构数据中学习是离线强化学习的基本挑战之一。传统方法侧重于利用来自单一回合或同质批次回合的预收集数据,为所有个体学习一个最优策略,因此可能导致对异构群体产生次优策略。本文针对异构的时平稳马尔可夫决策过程,提出了一种个体化离线策略优化框架。所提出的包含个体潜在变量的异构模型使我们能够有效估计个体Q函数,并且我们的惩罚性悲观个性化策略学习算法在行为策略满足弱部分覆盖假设下,保证了平均遗憾的快速收敛速率。此外,我们的模拟研究和实际数据应用表明,与现有方法相比,所提方法具有更优的数值性能。