This paper proposes a method to estimate the class separability of an unlabeled text dataset by inspecting the topological characteristics of sentence-transformer embeddings of the text. Experiments conducted involve both binary and multi-class cases, with balanced and imbalanced scenarios. The results demonstrate a clear correlation and a better consistency between the proposed method and other separability and classification metrics, such as Thornton's method and the AUC score of a logistic regression classifier, as well as unsupervised methods. Finally, we empirically show that the proposed method can be part of a stopping criterion for fine-tuning language-model classifiers. By monitoring the class separability of the embedding space after each training iteration, we can detect when the training process stops improving the separability of the embeddings without using additional labels.
翻译:本文提出一种方法,通过检查文本句子-Transformer嵌入的拓扑特征,估计未标注文本数据集的类别可分性。实验涵盖二分类与多分类情形,并考虑了平衡与不平衡场景。结果表明,所提方法与Thornton方法、逻辑回归分类器的AUC分数等其他可分性和分类指标,以及无监督方法之间存在明显相关性且一致性更优。最后,我们通过实验证明,该方法可作为语言模型分类器微调的停止准则组成部分。通过监测每次训练迭代后嵌入空间的类别可分性,我们能够在不使用额外标签的情况下,检测训练过程何时停止改善嵌入的可分性。