Designing good reflection questions is pedagogically important but time-consuming and unevenly supported across teachers. This paper introduces a reflection-in-reflection framework for automated generation of reflection questions with large language models (LLMs). Our approach coordinates two role-specialized agents, a Student-Teacher and a Teacher-Educator, that engage in a Socratic multi-turn dialogue to iteratively refine a single question given a teacher-specified topic, key concepts, student level, and optional instructional materials. The Student-Teacher proposes candidate questions with brief rationales, while the Teacher-Educator evaluates them along clarity, depth, relevance, engagement, and conceptual interconnections, responding only with targeted coaching questions or a fixed signal to stop the dialogue. We evaluate the framework in an authentic lower-secondary ICT setting on the topic, using GPT-4o-mini as the backbone model and a stronger GPT- 4-class LLM as an external evaluator in pairwise comparisons of clarity, relevance, depth, and overall quality. First, we study how interaction design and context (dynamic vs.fixed iteration counts; presence or absence of student level and materials) affect question quality. Dynamic stopping combined with contextual information consistently outperforms fixed 5- or 10-step refinement, with very long dialogues prone to drift or over-complication. Second, we show that our two-agent protocol produces questions that are judged substantially more relevant and deeper, and better overall, than a one-shot baseline using the same backbone model.
翻译:设计优质的反思问题在教学中至关重要,但耗时且不同教师获得的支持不均。本文提出一种“反思中的反思”框架,用于利用大语言模型自动生成反思问题。我们的方法协调两个角色特化的智能体——学生-教师和教师-教育者——使其围绕教师指定的主题、核心概念、学生水平及可选教学材料,通过苏格拉底式多轮对话迭代优化单个问题。学生-教师提出候选问题并附简要理由,教师-教育者则从清晰度、深度、相关性、参与度及概念关联性等维度进行评估,仅以针对性指导问题或固定终止信号作为回应。我们在真实的初中信息通信技术教学情境中对该框架进行评估,以GPT-4o-mini作为主干模型,并采用更强的GPT-4级别大语言模型作为外部评估器,在清晰度、相关性、深度及整体质量方面进行成对比较。首先,我们研究交互设计与上下文(动态与固定迭代次数;是否包含学生水平与教学材料)如何影响问题质量。动态终止机制结合上下文信息的表现持续优于固定5步或10步优化方案,而过长的对话易导致偏离主题或过度复杂化。其次,我们证明相较于使用相同主干模型的单次生成基线,我们的双智能体协议所产生的问题在相关性、深度及整体质量上均获得显著更高的评价。