The Socratic method is a way of guiding students toward solving a problem independently without directly revealing the solution to the problem. Although this method has been shown to significantly improve student learning outcomes, it remains a complex labor-intensive task for instructors. Large language models (LLMs) can be used to augment human effort by automatically generating Socratic questions for students. However, existing methods that involve prompting these LLMs sometimes produce invalid outputs, e.g., those that directly reveal the solution to the problem or provide irrelevant or premature questions. To alleviate this problem, inspired by reinforcement learning with AI feedback (RLAIF), we first propose a data augmentation method to enrich existing Socratic questioning datasets with questions that are invalid in specific ways. Next, we propose a method to optimize open-source LLMs such as LLama 2 to prefer ground-truth questions over generated invalid ones, using direct preference optimization (DPO). Our experiments on a Socratic questions dataset for student code debugging show that a DPO-optimized 7B LLama 2 model can effectively avoid generating invalid questions, and as a result, outperforms existing state-of-the-art prompting methods.
翻译:苏格拉底式教学法旨在引导学生独立解决问题而不直接揭示答案。尽管该方法已被证明能显著提升学生学习效果,但对教师而言仍是复杂且劳动密集型的工作。大型语言模型(LLMs)可通过自动生成苏格拉底式问题来辅助人类劳动。然而,现有基于提示工程的方法有时会产生无效输出,例如直接揭示问题解决方案、提供不相关或超前的问题。为缓解该问题,受基于AI反馈的强化学习(RLAIF)启发,我们首先提出一种数据增强方法,通过引入特定类型的无效问题来丰富现有苏格拉底式提问数据集。其次,我们提出一种基于直接偏好优化(DPO)的方法,优化开源LLMs(如LLama 2)使其优先生成真实问题而非无效问题。在学生代码调试的苏格拉底式问题数据集上的实验表明,经DPO优化的7B LLama 2模型能有效规避生成无效问题,并最终超越现有最先进的提示学习方法。