Despite numerous successes, the field of reinforcement learning (RL) remains far from matching the impressive generalisation power of human behaviour learning. One possible way to help bridge this gap be to provide RL agents with richer, more human-like feedback expressed in natural language. To investigate this idea, we first extend BabyAI to automatically generate language feedback from the environment dynamics and goal condition success. Then, we modify the Decision Transformer architecture to take advantage of this additional signal. We find that training with language feedback either in place of or in addition to the return-to-go or goal descriptions improves agents' generalisation performance, and that agents can benefit from feedback even when this is only available during training, but not at inference.
翻译:尽管强化学习领域取得了诸多成功,但其在泛化能力上仍远不及人类行为学习。弥合这一差距的潜在途径之一,是为强化学习智能体提供更丰富、更接近人类方式的自然语言反馈。为探究这一设想,我们首先扩展了BabyAI平台,使其能够根据环境动态和目标条件成功性自动生成语言反馈。随后,我们修改决策Transformer架构以充分利用这一额外信号。研究发现,用语言反馈替代或补充回报-目标或目标描述进行训练,均能提升智能体的泛化性能;并且,即便反馈仅在训练阶段可用而推理阶段不可用时,智能体仍能从中获益。