A key challenge in e-learning environments like Intelligent Tutoring Systems (ITSs) is to induce effective pedagogical policies efficiently. While Deep Reinforcement Learning (DRL) often suffers from sample inefficiency and reward function design difficulty, Apprenticeship Learning(AL) algorithms can overcome them. However, most AL algorithms can not handle heterogeneity as they assume all demonstrations are generated with a homogeneous policy driven by a single reward function. Still, some AL algorithms which consider heterogeneity, often can not generalize to large continuous state space and only work with discrete states. In this paper, we propose an expectation-maximization(EM)-EDM, a general AL framework to induce effective pedagogical policies from given optimal or near-optimal demonstrations, which are assumed to be driven by heterogeneous reward functions. We compare the effectiveness of the policies induced by our proposed EM-EDM against four AL-based baselines and two policies induced by DRL on two different but related tasks that involve pedagogical action prediction. Our overall results showed that, for both tasks, EM-EDM outperforms the four AL baselines across all performance metrics and the two DRL baselines. This suggests that EM-EDM can effectively model complex student pedagogical decision-making processes through the ability to manage a large, continuous state space and adapt to handle diverse and heterogeneous reward functions with very few given demonstrations.
翻译:在智能辅导系统(ITSs)等电子学习环境中,一个关键挑战是高效地推导有效的教学策略。尽管深度强化学习(DRL)常受样本效率低下和奖励函数设计困难的困扰,但学徒学习(AL)算法能够克服这些问题。然而,大多数AL算法无法处理异质性,因为它们假设所有演示都是由单一奖励函数驱动的同质策略生成的。此外,一些考虑异质性的AL算法通常无法泛化到大型连续状态空间,仅适用于离散状态。本文提出了一种期望最大化(EM)-EDM框架,这是一种通用的AL框架,用于从给定的最优或接近最优的演示中推导有效的教学策略,这些演示被假定为由异构奖励函数驱动。我们在两个不同但相关的教学行为预测任务上,将所提出的EM-EDM框架推导的策略与四种基于AL的基线方法以及两种由DRL推导的策略进行了有效性比较。总体结果表明,在这两个任务中,EM-EDM在所有性能指标上均优于四种AL基线方法和两种DRL基线方法。这表明,EM-EDM能够通过管理大型连续状态空间并适应处理多样化和异构的奖励函数,仅需极少量的给定演示即可有效建模复杂的学生教学决策过程。