Automatic generation of educational materials using large language models (LLMs) is becoming increasingly common, but assigning difficulty levels to such materials still requires substantial human effort. LLM-as-a-Judge has therefore attracted attention, yet disagreement with human raters remains a major challenge. We propose a method for predicting which LLM-generated difficulty ratings are likely to disagree with human raters, so that such cases can be sent for re-rating. Unlike prior approaches, our method does not rely on generation-time probability signals, which must be collected during rating generation and are often difficult to compare across LLMs. Instead, exploiting the fact that difficulty is an ordinal scale, we use a separate embedding space, such as ModernBERT, and identify disagreement candidates based on the geometric consistency of the rating set. Experiments on English CEFR-based sentence difficulty assessment with GPT-OSS-120B and Qwen3-235B-A22B showed that the proposed method achieved higher AUC for predicting disagreement with human raters than probability-based baselines.
翻译:使用大型语言模型(LLM)自动生成教育材料日益普遍,但为这些材料分配难度级别仍需大量人工投入。因此,LLM作为判官的方法引起了关注,然而与人类评分者的不一致仍是一个主要挑战。我们提出了一种方法,用于预测哪些LLM生成的难度评分可能与人类评分者不一致,从而可将此类案例提交重新评分。与以往方法不同,我们的方法不依赖生成时概率信号,这些信号必须在评分生成过程中收集,且通常难以跨LLM进行比较。相反,利用难度为有序量表这一事实,我们使用独立的嵌入空间(如ModernBERT),并基于评分集的几何一致性识别不一致候选。在基于CEFR的英语句子难度评估实验中(使用GPT-OSS-120B和Qwen3-235B-A22B),结果表明,所提方法在预测与人类评分者不一致方面取得了比基于概率的基线更高的AUC。