As Large Language Models (LLMs) are increasingly deployed to generate educational content, a critical safety question arises: can these models reliably estimate the difficulty of the questions they produce? Using Brazil's high-stakes ENEM exam as a testbed, we benchmark ten proprietary and open-weight LLMs against official Item Response Theory (IRT) parameters for 1,031 questions. We evaluate performance along three axes: absolute calibration, rank fidelity, and context sensitivity across learner backgrounds. Our results reveal a significant trade-off: while the best models achieve moderate rank correlation, they systematically underestimate difficulty and degrade significantly on multimodal items. Crucially, we find that models exhibit limited and inconsistent plasticity when prompted with student demographic cues, suggesting they are not yet ready for context-adaptive personalization. We conclude that LLMs function best as calibrated screeners rather than authoritative oracles, supporting an "evaluation-before-generation" pipeline for responsible assessment design.
翻译:随着大型语言模型(LLMs)越来越多地应用于生成教育内容,一个关键的安全问题随之产生:这些模型能否可靠地评估其生成题目的难度?本研究以巴西高风险考试ENEM为测试平台,针对1,031道题目,将十种专有及开源权重LLMs与官方项目反应理论(IRT)参数进行基准比较。我们从三个维度评估模型性能:绝对校准度、排序保真度以及针对不同学习者背景的语境敏感性。研究结果揭示了显著的权衡关系:虽然最优模型能达到中等程度的排序相关性,但它们系统性地低估题目难度,且在多模态题目上表现显著下降。关键发现是,当提示学生人口统计特征时,模型表现出有限且不一致的可塑性,这表明其尚未具备语境自适应个性化能力。我们得出结论:LLMs最适合作为校准筛选器而非权威预测器,这为负责任评估设计中的"生成前评估"流程提供了支持依据。