The safe application of reinforcement learning (RL) requires generalization from limited training data to unseen scenarios. Yet, fulfilling tasks under changing circumstances is a key challenge in RL. Current state-of-the-art approaches for generalization apply data augmentation techniques to increase the diversity of training data. Even though this prevents overfitting to the training environment(s), it hinders policy optimization. Crafting a suitable observation, only containing crucial information, has been shown to be a challenging task itself. To improve data efficiency and generalization capabilities, we propose Compact Reshaped Observation Processing (CROP) to reduce the state information used for policy optimization. By providing only relevant information, overfitting to a specific training layout is precluded and generalization to unseen environments is improved. We formulate three CROPs that can be applied to fully observable observation- and action-spaces and provide methodical foundation. We empirically show the improvements of CROP in a distributionally shifted safety gridworld. We furthermore provide benchmark comparisons to full observability and data-augmentation in two different-sized procedurally generated mazes.
翻译:摘要:强化学习的安全应用要求从有限的训练数据中泛化至未见场景。然而,在不断变化的环境下完成任务仍是强化学习中的关键挑战。当前最先进的泛化方法采用数据增强技术来增加训练数据的多样性。尽管这能防止对训练环境的过拟合,却阻碍了策略优化。仅包含关键信息的合适观测构建本身已被证明是一项艰巨任务。为提升数据效率与泛化能力,我们提出紧凑重塑观测处理(CROP),通过减少策略优化所依赖的状态信息实现目标。通过仅提供相关信息,可避免对特定训练布局的过拟合,并提升对未见环境的泛化能力。我们提出了三种适用于完全可观测观测-动作空间的CROP方法,并提供了方法论基础。通过分布偏移的安全网格世界实验,我们实证展示了CROP的改进效果。此外,我们在两类不同规模程序生成的迷宫中,提供了与完全可观测性和数据增强方法的基准对比结果。