Large Language Models (LLMs) have recently showcased remarkable reasoning abilities. However, larger models often surpass their smaller counterparts in reasoning tasks, posing the challenge of effectively transferring these capabilities from larger models. Existing approaches heavily rely on extensive fine-tuning data or continuous interactions with a superior teacher LLM during inference. We introduce a principle-based teacher-student framework called ``Teaching via Principle Discovery'' (TPD) to address these limitations. Inspired by human learning mechanisms, TPD mimics the interaction between a teacher and a student using a principle-based approach. The teacher LLM generates problem-solving instructions and corrective principles based on the student LLM's errors. These principles guide the refinement of instructions and the selection of instructive examples from a validation set. This enables the student model to learn from both the teacher's guidance and its own mistakes. Once the student model begins making inferences, TPD requires no further intervention from the teacher LLM or humans. Through extensive experiments across eight reasoning tasks, we demonstrate the effectiveness of TPD. Compared to standard chain-of-thought prompting, TPD significantly improves the student model's performance, achieving $6.2\%$ improvement on average.
翻译:大型语言模型(LLMs)近期展现出卓越的推理能力。然而,大型模型在推理任务中往往超越其小型对应模型,这带来了如何有效将大型模型能力迁移至小型模型的挑战。现有方法严重依赖大量微调数据或推理过程中与高级教师LLM的持续交互。我们提出一种基于原则的师生框架,称为“教学即原则发现”(TPD),以克服这些局限。受人类学习机制启发,TPD通过基于原则的方法模拟师生交互。教师LLM根据学生LLM的错误生成解题指导与纠正性原则。这些原则用于优化指令,并从验证集中选择具有指导性的示例。这使得学生模型既能从教师指导中学习,也能从自身错误中学习。一旦学生模型开始推理,TPD不再需要教师LLM或人类的干预。通过在八个推理任务上的广泛实验,我们验证了TPD的有效性。与标准思维链提示相比,TPD显著提升了学生模型的性能,平均提升幅度达$6.2\%$。