Benchmarking the relative capabilities of AI systems, including Large Language Models (LLMs) and Vision Models, typically ignores the impact of uncertainty in the underlying ground truth answers from experts. This ambiguity is not just limited to human preferences, but is also consequential even in safety critical domains such as medicine where uncertainty is pervasive. In this paper, we introduce a probabilistic paradigm to theoretically explain how - high certainty in ground truth answers is almost always necessary for even an expert to achieve high scores, whereas in datasets with high variation in ground truth answers there may be little difference between a random labeller and an expert. Therefore, ignoring uncertainty in ground truth evaluation data can result in the misleading conclusion that a non-expert has similar performance to that of an expert. Using the probabilistic paradigm, we thus bring forth the concepts of expected accuracy and expected F1 to estimate the score an expert human or system can achieve given ground truth answer variability. Our work leads to the recommendation that when establishing the capability of a system, results should be stratified by probability of the ground truth answer, typically measured by the agreement rate of ground truth experts. Stratification becomes critical when the overall performance drops below a threshold of 80\%. Under stratified evaluation, performance comparison becomes more reliable in high certainty bins, mitigating the effect of the key confounding factor -- uncertainty.
翻译:当前对人工智能系统(包括大语言模型和视觉模型)相对能力的基准测试,通常忽略了专家提供的基准真值答案中不确定性的影响。这种模糊性不仅限于人类偏好,即使在医学等不确定性普遍存在的安全关键领域也至关重要。本文引入一种概率范式,从理论上解释:即使是专家,也几乎总是需要基准真值答案具有高确定性才能获得高分;而在基准真值答案变异度较高的数据集中,随机标注者与专家之间可能几乎没有差异。因此,忽略基准真值评估数据中的不确定性,可能导致得出非专家与专家性能相似的误导性结论。利用该概率范式,我们提出期望准确率与期望F1分数的概念,以估计在给定基准真值答案变异度的情况下专家人类或系统可能获得的分数。我们的研究建议:在评估系统能力时,结果应按基准真值答案的概率进行分层(通常以基准真值专家的标注一致率衡量)。当整体性能低于80%阈值时,分层评估尤为关键。在分层评估框架下,高确定性区间的性能比较更为可靠,从而有效缓解关键混杂因素——不确定性的影响。