Can Language Models Teach Weaker Agents? Teacher Explanations Improve Students via Personalization

A hallmark property of explainable AI models is the ability to teach other agents, communicating knowledge of how to perform a task. While Large Language Models perform complex reasoning by generating explanations for their predictions, it is unclear whether they also make good teachers for weaker agents. To address this, we consider a student-teacher framework between two LLM agents and study if, when, and how the teacher should intervene with natural language explanations to improve the student's performance. Since communication is expensive, we define a budget such that the teacher only communicates explanations for a fraction of the data, after which the student should perform well on its own. We decompose the teaching problem along four axes: (1) if teacher's test time intervention improve student predictions, (2) when it is worth explaining a data point, (3) how the teacher should personalize explanations to better teach the student, and (4) if teacher explanations also improve students on future unexplained data. We first show that teacher LLMs can indeed intervene on student reasoning to improve their performance. Next, inspired by the Theory of Mind abilities of effective teachers, we propose building two few-shot mental models of the student. The first model defines an Intervention Function that simulates the utility of an intervention, allowing the teacher to intervene when this utility is the highest and improving student performance at lower budgets. The second model enables the teacher to personalize explanations for a particular student and outperform unpersonalized teachers. We also demonstrate that in multi-turn interactions, teacher explanations generalize and learning from explained data improves student performance on future unexplained data. Finally, we verify that misaligned teachers can lower student performance to random chance by intentionally misleading them.

翻译：可解释人工智能模型的一个标志性特性是能够教导其他智能体，传达如何执行任务的知识。尽管大型语言模型通过为其预测生成解释来执行复杂推理，但尚不清楚它们是否能成为较弱智能体的优秀教师。为解决这一问题，我们考虑两个LLM智能体之间的师生框架，研究教师是否、何时以及如何通过自然语言解释进行干预，以提升学生的表现。由于通信成本高昂，我们设定一个预算，使得教师仅对部分数据提供解释，此后学生应能独立表现良好。我们沿四个维度分解教学问题：（1）教师测试时的干预是否能改善学生预测，（2）何时值得对某个数据点进行解释，（3）教师应如何个性化解释以更好地教导学生，以及（4）教师的解释是否也能提升学生对未来未解释数据的表现。我们首先证明，教师LLM确实能干预学生的推理过程以提升其表现。接下来，受有效教师的心智理论能力启发，我们提出构建两个关于学生的少样本心智模型。第一个模型定义了一个干预函数，用于模拟干预的效用，使教师能在效用最高时进行干预，从而在较低预算下提升学生表现。第二个模型使教师能够为学生个性化解释，并超越非个性化教师。我们还证明，在多轮交互中，教师的解释具有泛化能力，从已解释数据中学习能提升学生对未来未解释数据的表现。最后，我们验证了不对齐的教师可通过故意误导学生，使其表现下降至随机水平。