A cornerstone of machine learning evaluation is the (often hidden) assumption that model and human responses are reliable enough to evaluate models against unitary, authoritative, ``gold standard'' data, via simple metrics such as accuracy, precision, and recall. The generative AI revolution would seem to explode this assumption, given the critical role stochastic inference plays. Yet, in spite of public demand for more transparency in AI -- along with strong evidence that humans are unreliable judges -- estimates of model reliability are conventionally based on, at most, a few output responses per input item. We adapt a method, previously used to evaluate the reliability of various metrics and estimators for machine learning evaluation, to determine whether an (existing or planned) dataset has enough responses per item to assure reliable null hypothesis statistical testing. We show that, for many common metrics, collecting even 5-10 responses per item (from each model and team of human evaluators) is not sufficient. We apply our methods to several of the very few extant gold standard test sets with multiple disaggregated responses per item and show that even these datasets lack enough responses per item. We show how our methods can help AI researchers make better decisions about how to collect data for AI evaluation.
翻译:机器学习评估的基石是(通常隐含的)一个假设:模型和人类的响应足够可靠,可以通过准确率、精确率和召回率等简单指标,将模型与单一的、权威的“黄金标准”数据进行评估。鉴于随机推断所起的关键作用,生成式人工智能革命似乎打破了这一假设。然而,尽管公众要求人工智能更加透明,且有强有力的证据表明人类是不可靠的评判者,但模型可靠性的估计通常最多基于每个输入项的几个输出响应。我们采用了一种先前用于评估机器学习评估中各种指标和估计量可靠性的方法,来确定(现有或计划中的)数据集是否拥有足够的每项响应,以确保可靠的零假设统计检验。我们证明,对于许多常见指标,即使收集每项任务5-10个响应(来自每个模型和人类评估团队)也是不够的。我们将我们的方法应用于现有的极少数每项任务具有多个独立响应的黄金标准测试集,并表明即使这些数据集也缺乏足够的每项响应。我们展示了我们的方法如何帮助人工智能研究人员就如何收集用于人工智能评估的数据做出更好的决策。