Recent work motivates moving large language model (LLM) evaluation from mean-based to tail-aware metrics, including conditional value-at-risk and tail-index estimates of reward-model error. We ask whether the canonical extreme-value-theory tail-index parameter, which isolates how heavy a tail is from how large the tail mass is, adds discriminative information beyond the mean and a standard tail-magnitude statistic in LLM evaluation. We pre-register a protocol covering admissibility, goodness-of-fit, threshold-stability, and effect-size requirements for any positive tail-shape claim. The protocol is the contribution of this paper; the empirical study below is a demonstration of what its gates catch. Applied to a standard LLM toxicity-evaluation setup under two structurally different scorer families, the protocol catches three distinct modes of false positives that a naive analysis would have published, and rejects the headline tail-shape claim on both scorers. We conclude that tail-shape estimation in the LLM toxicity-evaluation setups we examined is more fragile than the recent literature suggests, and recommend the protocol as a starting point for tail-index claims in similar setups.
翻译:近期研究推动将大型语言模型(LLM)评估从基于均值的方法转向基于尾部感知的指标,包括条件风险价值和对奖励模型误差的尾部指数估计。我们提出一个关键问题:标准的极值理论尾部指数参数——该参数将尾部厚度与尾部质量大小相分离——是否能在LLM评估中为均值及标准尾部幅度统计量提供额外的判别信息?我们预先注册了一套协议,涵盖任何关于正面尾部形态主张的可接受性、拟合优度、阈值稳定性及效应量要求。该协议是本文的核心贡献;后续的实证研究旨在展示其筛选机制的实际效果。将本协议应用于标准LLM毒性评估框架(包含两种结构迥异的评分器族)时,协议成功捕获了三种本会被朴素分析发表的假阳性模式,并驳斥了两种评分器上的标题性尾部形态主张。我们得出结论:在我们所检验的LLM毒性评估设置中,尾部形态估计比近期文献所呈现的更为脆弱,并建议将本协议作为类似场景下尾部指数论证的起点。