Recent advances in measuring hardness-wise properties of data guide language models in sample selection within low-resource scenarios. However, class-specific properties are overlooked for task setup and learning. How will these properties influence model learning and is it generalizable across datasets? To answer this question, this work formally initiates the concept of $\textit{class-wise hardness}$. Experiments across eight natural language understanding (NLU) datasets demonstrate a consistent hardness distribution across learning paradigms, models, and human judgment. Subsequent experiments unveil a notable challenge in measuring such class-wise hardness with instance-level metrics in previous works. To address this, we propose $\textit{GeoHard}$ for class-wise hardness measurement by modeling class geometry in the semantic embedding space. $\textit{GeoHard}$ surpasses instance-level metrics by over 59 percent on $\textit{Pearson}$'s correlation on measuring class-wise hardness. Our analysis theoretically and empirically underscores the generality of $\textit{GeoHard}$ as a fresh perspective on data diagnosis. Additionally, we showcase how understanding class-wise hardness can practically aid in improving task learning.
翻译:近年来,在衡量数据难度特性方面的进展指导了语言模型在低资源场景下的样本选择。然而,针对任务设置和学习过程的类别特异性属性却被忽视了。这些属性将如何影响模型学习?其是否在不同数据集间具有普适性?为回答这些问题,本研究正式提出了“类别级难度”的概念。在八个自然语言理解数据集上的实验表明,该难度分布在不同的学习范式、模型及人类判断中均呈现一致性。后续实验揭示了先前工作中使用实例级指标度量此类类别级难度时存在的显著挑战。为此,我们提出《GeoHard》,通过语义嵌入空间中的类别几何建模来度量类别级难度。在衡量类别级难度时,《GeoHard》在皮尔逊相关系数上超越实例级指标超过59%。我们的分析从理论和实证层面论证了《GeoHard》作为数据诊断新视角的普适性。此外,我们展示了理解类别级难度如何在实际中有助于改进任务学习。