Successful teaching requires an assumption of how the learner learns - how the learner uses experiences from the world to update their internal states. We investigate what expectations people have about a learner when they teach them in an online manner using rewards and punishment. We focus on a common reinforcement learning method, Q-learning, and examine what assumptions people have using a behavioral experiment. To do so, we first establish a normative standard, by formulating the problem as a machine teaching optimization problem. To solve the machine teaching optimization problem, we use a deep learning approximation method which simulates learners in the environment and learns to predict how feedback affects the learner's internal states. What do people assume about a learner's learning and discount rates when they teach them an idealized exploration-exploitation task? In a behavioral experiment, we find that people can teach the task to Q-learners in a relatively efficient and effective manner when the learner uses a small value for its discounting rate and a large value for its learning rate. However, they still are suboptimal. We also find that providing people with real-time updates of how possible feedback would affect the Q-learner's internal states weakly helps them teach. Our results reveal how people teach using evaluative feedback and provide guidance for how engineers should design machine agents in a manner that is intuitive for people.
翻译:成功的教学需要对学习者如何学习——即学习者如何利用来自世界的经验更新其内部状态——作出假设。我们研究人们以在线方式使用奖励和惩罚进行教学时,对学习者抱有怎样的期望。我们聚焦于一种常见的强化学习方法——Q学习,并通过行为实验考察人们的假设。为此,我们首先将问题形式化为机器教学优化问题,以建立规范性标准。为解决该机器教学优化问题,我们采用一种深度学习近似方法,该方法在环境中模拟学习者,并学习预测反馈如何影响学习者的内部状态。当人们教授一个理想化的探索-利用任务时,他们对学习者的学习率和贴现率有何假设?在一项行为实验中,我们发现,当学习者使用较小的贴现率和较大的学习率时,人们能够以相对高效且有效的方式将任务教授给Q学习智能体。然而,他们的教学仍非最优。我们还发现,向人们提供关于可能的反馈如何影响Q学习智能体内部状态的实时更新,对其教学帮助甚微。我们的结果揭示了人们如何使用评价性反馈进行教学,并为工程师设计符合人类直觉的机器智能体提供了指导。