Recently, it has been shown that for offline deep reinforcement learning (DRL), pre-training Decision Transformer with a large language corpus can improve downstream performance (Reid et al., 2022). A natural question to ask is whether this performance gain can only be achieved with language pre-training, or can be achieved with simpler pre-training schemes which do not involve language. In this paper, we first show that language is not essential for improved performance, and indeed pre-training with synthetic IID data for a small number of updates can match the performance gains from pre-training with a large language corpus; moreover, pre-training with data generated by a one-step Markov chain can further improve the performance. Inspired by these experimental results, we then consider pre-training Conservative Q-Learning (CQL), a popular offline DRL algorithm, which is Q-learning-based and typically employs a Multi-Layer Perceptron (MLP) backbone. Surprisingly, pre-training with simple synthetic data for a small number of updates can also improve CQL, providing consistent performance improvement on D4RL Gym locomotion datasets. The results of this paper not only illustrate the importance of pre-training for offline DRL but also show that the pre-training data can be synthetic and generated with remarkably simple mechanisms.
翻译:近来研究表明,在离线深度强化学习中,使用大规模语言语料预训练决策Transformer可提升下游任务表现(Reid等,2022)。一个自然的问题是:这种性能提升是否只能通过语言预训练实现,抑或可以通过不涉及语言的更简单预训练方案达成?本文首先证明,语言并非取得性能提升的必要条件——事实上,使用独立同分布合成数据进行少量更新预训练,即可达到与大规模语言语料预训练相当的效果;进一步地,采用单步马尔可夫链生成的数据进行预训练,还能带来更显著的性能提升。受这些实验结果启发,我们继而考虑对保守Q学习(CQL)——一种基于Q学习的流行离线强化学习算法(通常采用多层感知器网络作为主干)——进行预训练。令人惊讶的是,使用简单合成数据进行少量更新预训练同样能提升CQL性能,并在D4RL Gym运动数据集上取得持续稳定的性能改善。本文结果不仅揭示了预训练对离线强化学习的重要性,更表明预训练数据可以源自极为简单的合成生成机制。