Reinforcement learning algorithms typically utilize an interactive simulator (i.e., environment) with a predefined reward function for policy training. Developing such simulators and manually defining reward functions, however, is often time-consuming and labor-intensive. To address this, we propose an Offline Simulator (OffSim), a novel model-based offline inverse reinforcement learning (IRL) framework, to emulate environmental dynamics and reward structure directly from expert-generated state-action trajectories. OffSim jointly optimizes a high-entropy transition model and an IRL-based reward function to enhance exploration and improve the generalizability of the learned reward. Leveraging these learned components, OffSim can subsequently train a policy offline without further interaction with the real environment. Additionally, we introduce OffSim$^+$, an extension that incorporates a marginal reward for multi-dataset settings to enhance exploration. Extensive MuJoCo experiments demonstrate that OffSim achieves substantial performance gains over existing offline IRL methods, confirming its efficacy and robustness.
翻译:强化学习算法通常利用具有预定义奖励函数的交互式模拟器(即环境)进行策略训练。然而,开发此类模拟器并手动定义奖励函数通常耗时且费力。为解决此问题,我们提出了一种离线模拟器(OffSim),这是一种新颖的基于模型的离线逆强化学习框架,可直接从专家生成的状态-动作轨迹中模拟环境动态和奖励结构。OffSim联合优化一个高熵转移模型和一个基于逆强化学习的奖励函数,以增强探索性并提升所学奖励的泛化能力。利用这些学习到的组件,OffSim随后可以离线训练策略,而无需与真实环境进一步交互。此外,我们引入了OffSim$^+$,这是一种扩展版本,它在多数据集设置中引入了边际奖励以增强探索。大量的MuJoCo实验表明,OffSim相比现有离线逆强化学习方法取得了显著的性能提升,证实了其有效性和鲁棒性。