Human-AI policy specification is a novel procedure we define in which humans can collaboratively warm-start a robot's reinforcement learning policy. This procedure is comprised of two steps; (1) Policy Specification, i.e. humans specifying the behavior they would like their companion robot to accomplish, and (2) Policy Optimization, i.e. the robot applying reinforcement learning to improve the initial policy. Existing approaches to enabling collaborative policy specification are often unintelligible black-box methods, and are not catered towards making the autonomous system accessible to a novice end-user. In this paper, we develop a novel collaborative framework to allow humans to initialize and interpret an autonomous agent's behavior. Through our framework, we enable humans to specify an initial behavior model via unstructured, natural language (NL), which we convert to lexical decision trees. Next, we leverage these translated specifications, to warm-start reinforcement learning and allow the agent to further optimize these potentially suboptimal policies. Our approach warm-starts an RL agent by utilizing non-expert natural language specifications without incurring the additional domain exploration costs. We validate our approach by showing that our model is able to produce >80% translation accuracy, and that policies initialized by a human can match the performance of relevant RL baselines in two domains.
翻译:人机协作策略规范是我们定义的一种新颖流程,允许人类以协作方式预热启动机器人的强化学习策略。该流程包含两个步骤:(1) 策略规范,即由人类指定其期望伴侣机器人完成的行为;(2) 策略优化,即机器人通过强化学习改进初始策略。现有协作策略规范方法多采用不可解释的黑箱模型,且未针对非专业终端用户实现自主系统的可访问性。本文提出一种新型协作框架,使人类能够初始化并解释自主智能体的行为。通过该框架,人类可通过非结构化自然语言指定初始行为模型,我们将其转化为词汇决策树。接着,我们利用这些转化后的规范来预热启动强化学习,使智能体进一步优化这些潜在次优策略。我们的方法通过利用非专家级自然语言规范预热启动强化学习智能体,且无需增加额外领域探索成本。实验验证表明,我们的模型翻译准确率超过80%,且由人类初始化的策略在两个领域中的性能可匹配相关强化学习基线方法。