Can Language Models Teach Weaker Agents? Teacher Explanations Improve Students via Theory of Mind

Large Language Models (LLMs) perform complex reasoning by generating explanations for their predictions. However, a complementary goal of explanations is to also communicate useful knowledge that improves weaker agents. Hence, we investigate whether LLMs also make good teachers for weaker agents. In particular, we consider a student-teacher framework between two LLM agents and study if, when, and how the teacher should intervene with natural language explanations to improve the student's performance. Since communication is expensive, we define a budget such that the teacher only communicates explanations for a fraction of the data, after which the student should perform well on its own. We decompose the teaching problem along four axes: (1) if teacher's test time intervention improve student predictions, (2) when it is worth explaining a data point, (3) how the teacher should personalize explanations to better teach the student, and (4) if teacher explanations also improve student performance on future unexplained data. We first show that teacher LLMs can indeed intervene on student reasoning to improve their performance. Next, we propose a Theory of Mind approach, in which the teacher builds two few-shot mental models of the student. The first model defines an Intervention Function that simulates the utility of an intervention, allowing the teacher to intervene when this utility is the highest and improving student performance at lower budgets. The second model enables the teacher to personalize explanations for a particular student and outperform unpersonalized teachers. We also demonstrate that in multi-turn interactions, teacher explanations generalize and learning from explained data improves student performance on future unexplained data. Finally, we also verify that misaligned teachers can lower student performance to random chance by intentionally misleading them.

翻译：大型语言模型（LLMs）通过生成预测解释来执行复杂推理。然而，解释的互补目标还包括传递有用知识以提升更弱智能体的能力。为此，我们探究LLMs是否也能成为更弱智能体的优秀教师。具体而言，我们考虑两个LLM智能体之间的师生框架，研究教师是否、何时以及如何通过自然语言解释进行干预以提升学生表现。由于通信成本高昂，我们设定预算约束：教师仅对部分数据提供解释，此后学生需自主完成剩余任务。我们将教学问题分解为四个维度：（1）教师测试时干预能否提升学生预测性能；（2）何时值得对特定数据点进行解释；（3）教师应如何个性化解释以更好地教导学生；（4）教师解释能否提升学生对未来未解释数据的表现。我们首先证明教师LLM确实能通过干预学生推理过程提升其表现。其次提出基于心智理论的方法，教师为学生构建两个少样本心智模型：第一个模型定义干预函数模拟干预效用，使教师能在效用最高时实施干预，从而在更低预算下提升学生表现；第二个模型使教师能为特定学生定制个性化解释，其效果优于非个性化教师。我们还证明在多轮交互中，教师解释具有泛化性，从已解释数据中学习能提升学生对未解释数据的表现。最后验证了当教师存在目标对齐偏差时，可能通过故意误导使学生表现降至随机水平。