Characterizing samples that are difficult to learn from is crucial to developing highly performant ML models. This has led to numerous Hardness Characterization Methods (HCMs) that aim to identify "hard" samples. However, there is a lack of consensus regarding the definition and evaluation of "hardness". Unfortunately, current HCMs have only been evaluated on specific types of hardness and often only qualitatively or with respect to downstream performance, overlooking the fundamental quantitative identification task. We address this gap by presenting a fine-grained taxonomy of hardness types. Additionally, we propose the Hardness Characterization Analysis Toolkit (H-CAT), which supports comprehensive and quantitative benchmarking of HCMs across the hardness taxonomy and can easily be extended to new HCMs, hardness types, and datasets. We use H-CAT to evaluate 13 different HCMs across 8 hardness types. This comprehensive evaluation encompassing over 14K setups uncovers strengths and weaknesses of different HCMs, leading to practical tips to guide HCM selection and future development. Our findings highlight the need for more comprehensive HCM evaluation, while we hope our hardness taxonomy and toolkit will advance the principled evaluation and uptake of data-centric AI methods.
翻译:准确表征难以学习的样本对于开发高性能机器学习模型至关重要。这一需求催生了众多旨在识别"困难"样本的难度表征方法(Hardness Characterization Methods, HCMs)。然而,目前学界对"难度"的定义与评估缺乏共识。现有HCMs仅针对特定类型的难度进行评估,且评估方式往往偏重定性分析或下游任务性能,忽视了基础的量化识别任务。为填补这一空白,本文提出了一种细粒度的难度类型分类体系。此外,我们开发了难度表征分析工具包(H-CAT),该工具支持对HCMs进行跨难度类型的全面量化基准测试,并可便捷扩展至新的HCM方法、难度类型及数据集。我们利用H-CAT对13种不同HCMs在8类难度类型上进行了评估。这项涵盖超过14,000个实验配置的全面评估,揭示了不同HCMs的优势与局限,为HCM选择及未来研发提供了实用指南。研究结果凸显了开展更全面HCM评估的必要性,同时希望我们提出的难度分类体系与分析工具能推动数据为中心AI方法的规范化评估与广泛采纳。