A popular paradigm for offline Reinforcement Learning (RL) tasks is to first fit the offline trajectories to a sequence model, and then prompt the model for actions that lead to high expected return. In addition to obtaining accurate sequence models, this paper highlights that tractability, the ability to exactly and efficiently answer various probabilistic queries, plays an important role in offline RL. Specifically, due to the fundamental stochasticity from the offline data-collection policies and the environment dynamics, highly non-trivial conditional/constrained generation is required to elicit rewarding actions. it is still possible to approximate such queries, we observe that such crude estimates significantly undermine the benefits brought by expressive sequence models. To overcome this problem, this paper proposes Trifle (Tractable Inference for Offline RL), which leverages modern Tractable Probabilistic Models (TPMs) to bridge the gap between good sequence models and high expected returns at evaluation time. Empirically, Trifle achieves the most state-of-the-art scores in 9 Gym-MuJoCo benchmarks against strong baselines. Further, owing to its tractability, Trifle significantly outperforms prior approaches in stochastic environments and safe RL tasks (e.g. with action constraints) with minimum algorithmic modifications.
翻译:离线强化学习任务的一种流行范式是首先将离线轨迹拟合到序列模型,然后提示模型生成能够带来高期望回报的动作。除了获得准确的序列模型外,本文强调可处理性——即精确且高效地回答各种概率查询的能力——在离线强化学习中扮演着重要角色。具体而言,由于离线数据收集策略和环境动态的基本随机性,需要高度非平凡的约束/条件生成才能激发出具有高回报的动作。虽然仍有可能近似此类查询,但我们观察到这种粗略估计会显著削弱表达性序列模型带来的优势。为解决此问题,本文提出Trifle(离线强化学习的可处理推断),该方法利用现代可处理概率模型来弥合良好序列模型与评估时高期望回报之间的差距。实验表明,在与强基线的对比中,Trifle在9个Gym-MuJoCo基准测试中取得了最先进的分数。此外,得益于其可处理性,Trifle在随机环境和安全强化学习任务中显著优于先前方法,且仅需最小的算法修改。