This paper introduces an unsupervised method to estimate the class separability of text datasets from a topological point of view. Using persistent homology, we demonstrate how tracking the evolution of embedding manifolds during training can inform about class separability. More specifically, we show how this technique can be applied to detect when the training process stops improving the separability of the embeddings. Our results, validated across binary and multi-class text classification tasks, show that the proposed method's estimates of class separability align with those obtained from supervised methods. This approach offers a novel perspective on monitoring and improving the fine-tuning of sentence transformers for classification tasks, particularly in scenarios where labeled data is scarce. We also discuss how tracking these quantities can provide additional insights into the properties of the trained classifier.
翻译:本文提出一种从拓扑学角度估计文本数据集类别可分性的无监督方法。利用持久同调技术,我们展示了如何通过追踪训练过程中嵌入流形的演化来揭示类别可分性信息。具体而言,我们证明了该方法可用于检测训练过程何时停止提升嵌入向量的可分性。在二元及多元文本分类任务上的验证结果表明,所提方法对类别可分性的估计结果与有监督方法所得结论一致。该方法为监控和改进句子Transformer在分类任务中的微调过程提供了新颖视角,特别适用于标注数据稀缺的场景。我们还探讨了追踪这些拓扑量如何为训练后分类器的特性提供额外洞察。