森林与树木：可复现机器学习评估中的$(N, K)$权衡 (Forest vs Tree: The $(N, K)$ Trade-off in Reproducible ML Evaluation)

Reproducibility is a cornerstone of scientific validation and of the authority it confers on its results. Reproducibility in machine learning evaluations leads to greater trust, confidence, and value. However, the ground truth responses used in machine learning often necessarily come from humans, among whom disagreement is prevalent, and surprisingly little research has studied the impact of effectively ignoring disagreement in these responses, as is typically the case. One reason for the lack of research is that budgets for collecting human-annotated evaluation data are limited, and obtaining more samples from multiple raters for each example greatly increases the per-item annotation costs. We investigate the trade-off between the number of items ($N$) and the number of responses per item ($K$) needed for reliable machine learning evaluation. We analyze a diverse collection of categorical datasets for which multiple annotations per item exist, and simulated distributions fit to these datasets, to determine the optimal $(N, K)$ configuration, given a fixed budget ($N \times K$), for collecting evaluation data and reliably comparing the performance of machine learning models. Our findings show, first, that accounting for human disagreement may come with $N \times K$ at no more than 1000 (and often much lower) for every dataset tested on at least one metric. Moreover, this minimal $N \times K$ almost always occurred for $K > 10$. Furthermore, the nature of the tradeoff between $K$ and $N$, or if one even existed, depends on the evaluation metric, with metrics that are more sensitive to the full distribution of responses performing better at higher levels of $K$. Our methods can be used to help ML practitioners get more effective test data by finding the optimal metrics and number of items and annotations per item to collect to get the most reliability for their budget.

翻译：可复现性是科学验证及其结果权威性的基石。机器学习评估的可复现性能够带来更高的信任度、置信度和价值。然而，机器学习中使用的真实响应通常必然来自人类，而人类之间的分歧普遍存在。令人惊讶的是，很少有研究探讨在这些响应中实际忽略分歧的影响（这通常是常见做法）。缺乏相关研究的原因之一在于，收集人工标注评估数据的预算有限，且为每个样本获取来自多个评分者的更多响应会大幅增加单条标注成本。本研究探究了可靠机器学习评估所需样本数量（$N$）与单样本响应数量（$K$）之间的权衡关系。我们分析了多个存在单样本多标注的分类数据集，并基于这些数据集拟合模拟分布，以确定在固定预算（$N \\times K$）下收集评估数据并可靠比较机器学习模型性能的最优$(N, K)$配置。研究结果首先表明：考虑人类分歧时，所有测试数据集中至少有一个指标所需的$N \\times K$不超过1000（通常远低于此值）。此外，该最小$N \\times K$几乎总是出现在$K > 10$的情况下。更重要的是，$K$与$N$之间的权衡关系（或是否存在权衡）取决于评估指标：对响应完整分布更敏感的指标在较高$K$值下表现更优。本方法可帮助机器学习从业者通过寻找最优指标、样本数量及单样本标注数量，在预算范围内获得最可靠的测试数据。