High sample complexity has long been a challenge for RL. On the other hand, humans learn to perform tasks not only from interaction or demonstrations, but also by reading unstructured text documents, e.g., instruction manuals. Instruction manuals and wiki pages are among the most abundant data that could inform agents of valuable features and policies or task-specific environmental dynamics and reward structures. Therefore, we hypothesize that the ability to utilize human-written instruction manuals to assist learning policies for specific tasks should lead to a more efficient and better-performing agent. We propose the Read and Reward framework. Read and Reward speeds up RL algorithms on Atari games by reading manuals released by the Atari game developers. Our framework consists of a QA Extraction module that extracts and summarizes relevant information from the manual and a Reasoning module that evaluates object-agent interactions based on information from the manual. Auxiliary reward is then provided to a standard A2C RL agent, when interaction is detected. When assisted by our design, A2C improves on 4 games in the Atari environment with sparse rewards, and requires 1000x less training frames compared to the previous SOTA Agent 57 on Skiing, the hardest game in Atari.
翻译:高样本复杂度长期以来一直是强化学习面临的挑战。另一方面,人类不仅通过交互或演示学习执行任务,还通过阅读非结构化的文本文档(例如操作手册)来学习。操作手册和维基页面是最丰富的数据来源之一,它们可以告知智能体有价值的特征和策略,或任务特定的环境动态及奖励结构。因此,我们假设,利用人类编写的操作手册来帮助学习特定任务的策略的能力,应能带来更高效且性能更优的智能体。我们提出了“阅读与奖励”(Read and Reward)框架。该框架通过阅读Atari游戏开发者发布的手册,加速了在Atari游戏上的强化学习算法。我们的框架包括一个问答提取模块,用于从手册中提取和总结相关信息;以及一个推理模块,用于基于手册中的信息评估物体与智能体的交互。当检测到交互时,会向标准的A2C强化学习智能体提供辅助奖励。在我们的设计辅助下,A2C在Atari环境中奖励稀疏的4个游戏上取得了改进,并且在Atari中最难的游戏——Skiing上,所需的训练帧数比之前的最先进Agent 57少1000倍。