The Rashomon set is the set of models that perform approximately equally well on a given dataset, and the Rashomon ratio is the fraction of all models in a given hypothesis space that are in the Rashomon set. Rashomon ratios are often large for tabular datasets in criminal justice, healthcare, lending, education, and in other areas, which has practical implications about whether simpler models can attain the same level of accuracy as more complex models. An open question is why Rashomon ratios often tend to be large. In this work, we propose and study a mechanism of the data generation process, coupled with choices usually made by the analyst during the learning process, that determines the size of the Rashomon ratio. Specifically, we demonstrate that noisier datasets lead to larger Rashomon ratios through the way that practitioners train models. Additionally, we introduce a measure called pattern diversity, which captures the average difference in predictions between distinct classification patterns in the Rashomon set, and motivate why it tends to increase with label noise. Our results explain a key aspect of why simpler models often tend to perform as well as black box models on complex, noisier datasets.
翻译:拉什蒙集是指给定数据集上性能大致相等的模型集合,而拉什蒙比则是给定假设空间中属于拉什蒙集的模型比例。在刑事司法、医疗、借贷、教育及其他领域的表格数据集中,拉什蒙比通常较大,这一现象具有实际意义——它关乎更简单的模型能否达到与复杂模型同等水平的准确率。一个悬而未决的问题是:为何拉什蒙比往往趋于较大值?本文提出并研究了数据生成过程中的一种机制,结合分析师在学习过程中通常做出的选择,共同决定了拉什蒙比的大小。具体而言,我们通过实践者训练模型的方式证明:噪声更大的数据集会导致更大的拉什蒙比。此外,我们引入了一种名为“模式多样性”的度量指标,用于捕捉拉什蒙集中不同分类模式预测结果的平均差异,并论证了该指标为何倾向于随标签噪声增大而增加。我们的研究结果揭示了为何在复杂、高噪声数据集上,更简单的模型通常能表现与黑箱模型相当的关键原因。