Efficient Detection of Bad Benchmark Items with Novel Scalability Coefficients

The validity of assessments, from large-scale AI benchmarks to human classrooms, depends on the quality of individual items, yet modern evaluation instruments often contain thousands of items with minimal psychometric vetting. We introduce a new family of nonparametric scalability coefficients based on interitem isotonic regression for efficiently detecting globally bad items (e.g., miskeyed, ambiguously worded, or construct-misaligned). The central contribution is the signed isotonic $R^2$, which measures the maximal proportion of variance in one item explainable by a monotone function of another while preserving the direction of association via Kendall's $τ$. Aggregating these pairwise coefficients yields item-level scores that sharply separate problematic items from acceptable ones without assuming linearity or committing to a parametric item response model. We show that the signed isotonic $R^2$ is extremal among monotone predictors (it extracts the strongest possible monotone signal between any two items) and show that this optimality property translates directly into practical screening power. Across three AI benchmark datasets (HS Math, GSM8K, MMLU) and two human assessment datasets, the signed isotonic $R^2$ consistently achieves top-tier AUC for ranking bad items above good ones, outperforming or matching a comprehensive battery of classical test theory, item response theory, and dimensionality-based diagnostics. Crucially, the method remains robust under the small-n/large-p conditions typical of AI evaluation, requires only bivariate monotone fits computable in seconds, and handles mixed item types (binary, ordinal, continuous) without modification. It is a lightweight, model-agnostic filter that can materially reduce the reviewer effort needed to find flawed items in modern large-scale evaluation regimes.

翻译：评估的有效性——从大规模AI基准测试到人类课堂——取决于单个题目的质量，然而现代评估工具常包含数千道题目，却极少接受心理测量学验证。我们引入了一类基于项目间等渗回归的新型非参数可扩展性系数，用于高效检测全局低质量题目（如答案标注错误、措辞歧义或与构念偏离的题目）。核心贡献在于带符号等渗$R^2$，该指标通过保留Kendall's $τ$所揭示的关联方向，衡量一个题目中可由另一题目的单调函数解释的最大方差比例。对这些成对系数进行聚合，可生成题目层面的得分，无需假设线性关系或绑定参数化项目反应模型，即可将问题题目与可接受题目清晰区分。我们证明带符号等渗$R^2$是单调预测器中的极值（能提取任意两个题目间最强的单调信号），且该最优性直接转化为实际筛选效力。在三个AI基准数据集（HS Math、GSM8K、MMLU）和两个人类评估数据集上，带符号等渗$R^2$在排序低质量题目与高质量题目时持续达到顶级AUC，全面超越或匹敌经典测试理论、项目反应理论和基于维度的诊断方法。关键在于，该方法在AI评估典型的小样本/高维度条件下仍保持稳健，仅需数秒即可完成双变量单调拟合计算，且无需修改即可处理混合题目类型（二分类、有序、连续变量）。作为一种轻量级、模型无关的过滤工具，该方法能显著减少在现代大规模评估体系中筛查问题题目所需的人工审阅工作量。