Traditional synchronous STEM assessments face growing challenges including accessibility barriers, security concerns from resource-sharing platforms, and limited comparability across institutions. We present a framework for generating and evaluating large-scale isomorphic physics problem banks using Generative AI to enable asynchronous, multi-attempt assessments. Isomorphic problems test identical concepts through varied surface features and contexts, providing richer variation than conventional parameterized questions while maintaining consistent difficulty. Our generation framework employs prompt chaining and tool use to achieve precise control over structural variations (numeric values, spatial relations) alongside diverse contextual variations. For pre-deployment validation, we evaluate generated items using 17 open-source language models (LMs) (0.6B-32B) and compare against actual student performance (N>200) across three midterm exams. Results show that 73% of deployed banks achieve statistically homogeneous difficulty, and LMs pattern correlate strongly with student performance (Pearson's $ρ$ up to 0.594). Additionally, LMs successfully identify problematic variants, such as ambiguous problem texts. Model scale also proves critical for effective validation, where extremely small (<4B) and large (>14B) models exhibit floor and ceiling effects respectively, making mid-sized models optimal for detecting difficulty outliers.
翻译:传统的同步STEM评估面临日益严峻的挑战,包括可访问性障碍、资源共享平台引发的安全隐患,以及跨机构可比性有限等问题。本文提出一个利用生成式人工智能构建与评估大规模异构物理问题库的框架,以支持异步、多尝试的评估模式。异构问题通过变化的表层特征与情境测试相同的核心概念,在保持难度一致性的同时,提供了比传统参数化问题更丰富的变体。我们的生成框架采用提示链与工具调用技术,实现对结构变异(数值、空间关系)与多样化情境变异的精准控制。为进行部署前验证,我们使用17个开源语言模型(参数量0.6B-32B)对生成题目进行评估,并与三次期中考试中实际学生表现(样本量>200)进行对比。结果显示:73%的已部署题库实现了统计意义上同质的难度水平,且语言模型的预测模式与学生表现呈强相关(皮尔逊相关系数$ρ$最高达0.594)。此外,语言模型能有效识别问题变体中的缺陷,如表述模糊的题目文本。模型规模对验证效果至关重要:极小模型(<4B)与极大模型(>14B)分别呈现地板效应与天花板效应,而中等规模模型在检测难度异常值时表现最优。