Efficient Detection of Bad Benchmark Items with Novel Scalability Coefficients

The validity of assessments, from large-scale AI benchmarks to human classrooms, depends on the quality of individual items, yet modern evaluation instruments often contain thousands of items with minimal psychometric vetting. We introduce a new family of nonparametric scalability coefficients based on interitem isotonic regression for efficiently detecting globally bad items (e.g., miskeyed, ambiguously worded, or construct-misaligned). The central contribution is the signed isotonic $R^2$, which measures the maximal proportion of variance in one item explainable by a monotone function of another while preserving the direction of association via Kendall's $τ$. Aggregating these pairwise coefficients yields item-level scores that sharply separate problematic items from acceptable ones without assuming linearity or committing to a parametric item response model. We show that the signed isotonic $R^2$ is extremal among monotone predictors (it extracts the strongest possible monotone signal between any two items) and show that this optimality property translates directly into practical screening power. Across three AI benchmark datasets (HS Math, GSM8K, MMLU) and two human assessment datasets, the signed isotonic $R^2$ consistently achieves top-tier AUC for ranking bad items above good ones, outperforming or matching a comprehensive battery of classical test theory, item response theory, and dimensionality-based diagnostics. Crucially, the method remains robust under the small-n/large-p conditions typical of AI evaluation, requires only bivariate monotone fits computable in seconds, and handles mixed item types (binary, ordinal, continuous) without modification. It is a lightweight, model-agnostic filter that can materially reduce the reviewer effort needed to find flawed items in modern large-scale evaluation regimes.

翻译：评估的有效性——从大规模人工智能基准测试到人类课堂——取决于单个题项的质量，然而现代评估工具通常包含数千个题项，却极少进行心理测量学验证。我们基于项目间保序回归引入了一类新的非参数可扩展性系数，用于高效检测全局不良题项（例如，错误密钥、措辞模糊或构念错位的题项）。核心贡献在于有符号保序 $R^2$，它衡量一个题项中可由另一个题项的单调函数解释的最大方差比例，同时通过 Kendall 的 $τ$ 保留关联方向。聚合这些成对系数可生成题项级得分，在无需假设线性或依赖参数化题项响应模型的情况下，将问题题项与可接受题项清晰区分。我们证明有符号保序 $R^2$ 在单调预测器中具有极值性（它提取任意两个题项之间最强的可能单调信号），并且这种最优性直接转化为实用的筛选能力。在三个AI基准数据集（HS Math、GSM8K、MMLU）和两个人类评估数据集上，有符号保序 $R^2$ 持续达到将不良题项排在优秀题项之上的顶级AUC，其表现优于或持平于经典测试理论、项目反应理论和基于维度的诊断方法所构成的全面对比组。关键的是，该方法在AI评估典型的小样本/大特征条件下保持稳健，仅需数秒即可完成双变量单调拟合计算，且无需修改即可处理混合题项类型（二分类、有序、连续）。这是一个轻量级、模型无关的过滤器，能显著减少在现代大规模评估场景中查找有缺陷题项所需的评审工作量。