Reinforcement learning (RL) agents are commonly trained and evaluated in the same environment. In contrast, humans often train in a specialized environment before being evaluated, such as studying a book before taking an exam. The potential of such specialized training environments is still vastly underexplored, despite their capacity to dramatically speed up training. The framework of synthetic environments takes a first step in this direction by meta-learning neural network-based Markov decision processes (MDPs). The initial approach was limited to toy problems and produced environments that did not transfer to unseen RL algorithms. We extend this approach in three ways: Firstly, we modify the meta-learning algorithm to discover environments invariant towards hyperparameter configurations and learning algorithms. Secondly, by leveraging hardware parallelism and introducing a curriculum on an agent's evaluation episode horizon, we can achieve competitive results on several challenging continuous control problems. Thirdly, we surprisingly find that contextual bandits enable training RL agents that transfer well to their evaluation environment, even if it is a complex MDP. Hence, we set up our experiments to train synthetic contextual bandits, which perform on par with synthetic MDPs, yield additional insights into the evaluation environment, and can speed up downstream applications.
翻译:强化学习(RL)智能体通常在相同环境中进行训练和评估。相比之下,人类常在专门环境中训练后再接受评估,例如在考试前研读书籍。此类专门训练环境虽能显著加速训练进程,但其潜力仍未得到充分探索。合成环境框架通过元学习基于神经网络的马尔可夫决策过程(MDPs),朝着这个方向迈出了第一步。初始方法仅限于玩具问题,且生成的环境无法迁移到未见过的RL算法。我们从三个方面扩展了该方法:首先,我们改进元学习算法以发现对超参数配置和学习算法具有不变性的环境。其次,通过利用硬件并行性并引入智能体评估回合视野的课程学习,我们能在多个具有挑战性的连续控制问题上取得有竞争力的结果。第三,我们意外地发现上下文赌博机能够训练出良好迁移到评估环境的RL智能体,即使该评估环境是复杂的MDP。因此,我们通过实验训练合成上下文赌博机,其性能与合成MDP相当,能提供对评估环境的额外洞见,并可加速下游应用。