The success of many machine learning (ML) methods depends crucially on having large amounts of labeled data. However, obtaining enough labeled data can be expensive, time-consuming, and subject to ethical constraints for many applications. One approach that has shown tremendous value in addressing this challenge is semi-supervised learning (SSL); this technique utilizes both labeled and unlabeled data during training, often with much less labeled data than unlabeled data, which is often relatively easy and inexpensive to obtain. In fact, SSL methods are particularly useful in applications where the cost of labeling data is especially expensive, such as medical analysis, natural language processing (NLP), or speech recognition. A subset of SSL methods that have achieved great success in various domains involves algorithms that integrate graph-based techniques. These procedures are popular due to the vast amount of information provided by the graphical framework and the versatility of their applications. In this work, we propose an algebraic topology-based semi-supervised method called persistent Laplacian-enhanced graph MBO (PL-MBO) by integrating persistent spectral graph theory with the classical Merriman-Bence- Osher (MBO) scheme. Specifically, we use a filtration procedure to generate a sequence of chain complexes and associated families of simplicial complexes, from which we construct a family of persistent Laplacians. Overall, it is a very efficient procedure that requires much less labeled data to perform well compared to many ML techniques, and it can be adapted for both small and large datasets. We evaluate the performance of the proposed method on data classification, and the results indicate that the proposed technique outperforms other existing semi-supervised algorithms.
翻译:许多机器学习(ML)方法的成功在很大程度上依赖于大量标注数据的可用性。然而,在许多应用中,获取足够的标注数据可能成本高昂、耗时且受伦理约束。半监督学习(SSL)是解决这一挑战极具价值的方法;该技术利用训练过程中同时使用标注和未标注数据,通常标注数据远少于未标注数据(后者通常相对容易且成本低廉地获取)。事实上,SSL方法尤其适用于标注数据成本特别高的应用场景,如医学分析、自然语言处理(NLP)或语音识别。在SSL方法中,一类整合图技术的算法已在多个领域取得巨大成功。这些方法因图框架提供的大量信息及其应用的通用性而广受欢迎。本研究提出一种基于代数拓扑的半监督方法——持久拉普拉斯增强图MBO(PL-MBO),通过将持久谱图理论与经典Merriman-Bence-Osher(MBO)方案相结合。具体而言,我们利用过滤过程生成链复形序列及其伴随单纯复形族,并从中构建持久拉普拉斯算子族。总体而言,该方法是一种高效程序,相较于多数ML技术,仅需极少量标注数据即可取得良好性能,且适用于小规模与大规模数据集。我们对所提方法在数据分类任务上的性能进行评估,结果表明该技术优于现有其他半监督算法。