We evaluate the effectiveness of GPT-4 Turbo in generating educational questions from NCERT textbooks in zero-shot mode. Our study highlights GPT-4 Turbo's ability to generate questions that require higher-order thinking skills, especially at the "understanding" level according to Bloom's Revised Taxonomy. While we find a notable consistency between questions generated by GPT-4 Turbo and those assessed by humans in terms of complexity, there are occasional differences. Our evaluation also uncovers variations in how humans and machines evaluate question quality, with a trend inversely related to Bloom's Revised Taxonomy levels. These findings suggest that while GPT-4 Turbo is a promising tool for educational question generation, its efficacy varies across different cognitive levels, indicating a need for further refinement to fully meet educational standards.
翻译:我们评估了 GPT-4 Turbo 在零样本模式下从 NCERT 教材生成教育问题的有效性。我们的研究突显了 GPT-4 Turbo 生成需要高阶思维能力的问题的能力,尤其是在布鲁姆修订分类法的"理解"层级。虽然我们发现 GPT-4 Turbo 生成的问题与人工评估的问题在复杂性方面存在显著的一致性,但偶尔也存在差异。我们的评估还揭示了人类与机器在评估问题质量方式上的差异,其趋势与布鲁姆修订分类法的层级呈负相关。这些发现表明,尽管 GPT-4 Turbo 是教育问题生成的一个有前景的工具,但其在不同认知层级的效能存在差异,这表明需要进一步改进才能完全满足教育标准。