Offline reinforcement learning (RL) algorithms are applied to learn performant, well-generalizing policies when provided with a static dataset of interactions. Many recent approaches to offline RL have seen substantial success, but with one key caveat: they demand substantial per-dataset hyperparameter tuning to achieve reported performance, which requires policy rollouts in the environment to evaluate; this can rapidly become cumbersome. Furthermore, substantial tuning requirements can hamper the adoption of these algorithms in practical domains. In this paper, we present TD3 with Behavioral Supervisor Tuning (TD3-BST), an algorithm that trains an uncertainty model and uses it to guide the policy to select actions within the dataset support. TD3-BST can learn more effective policies from offline datasets compared to previous methods and achieves the best performance across challenging benchmarks without requiring per-dataset tuning.
翻译:离线强化学习(RL)算法用于在给定静态交互数据集时学习性能良好且具有良好泛化能力的策略。近年来众多离线RL方法取得了显著成功,但存在一个关键局限:它们需要对每个数据集进行大量超参数调优才能达到报告的性能,这需要在环境中执行策略回滚评估,从而迅速变得繁琐。此外,大量调优需求会阻碍这些算法在实际领域的应用。本文提出带行为监督调优的TD3(TD3-BST)算法,该算法训练一个不确定性模型,并利用其引导策略在数据集支持范围内选择动作。与先前方法相比,TD3-BST能从离线数据集中学习更有效的策略,且在无需对每个数据集进行调优的情况下,在多个具有挑战性的基准测试中达到最佳性能。