Finding relevant and high-quality datasets to train machine learning models is a major bottleneck for practitioners. Furthermore, to address ambitious real-world use-cases there is usually the requirement that the data come labelled with high-quality annotations that can facilitate the training of a supervised model. Manually labelling data with high-quality labels is generally a time-consuming and challenging task and often this turns out to be the bottleneck in a machine learning project. Weak Supervised Learning (WSL) approaches have been developed to alleviate the annotation burden by offering an automatic way of assigning approximate labels (pseudo-labels) to unlabelled data based on heuristics, distant supervision and knowledge bases. We apply probabilistic generative latent variable models (PLVMs), trained on heuristic labelling representations of the original dataset, as an accurate, fast and cost-effective way to generate pseudo-labels. We show that the PLVMs achieve state-of-the-art performance across four datasets. For example, they achieve 22% points higher F1 score than Snorkel in the class-imbalanced Spouse dataset. PLVMs are plug-and-playable and are a drop-in replacement to existing WSL frameworks (e.g. Snorkel) or they can be used as benchmark models for more complicated algorithms, giving practitioners a compelling accuracy boost.
翻译:寻找相关且高质量的数据集来训练机器学习模型是从业者面临的主要瓶颈。此外,为了实现具有挑战性的实际应用,通常要求数据带有高质量标注,以便于训练监督模型。人工标注高质量标签通常耗时且困难,这往往成为机器学习项目的瓶颈。弱监督学习方法通过基于启发式规则、远程监督和知识库为未标注数据自动分配近似标签(伪标签),从而缓解了标注负担。我们应用概率生成潜变量模型(PLVM),该模型基于原始数据集的启发式标注表示进行训练,作为一种准确、快速且成本效益高的伪标签生成方法。实验表明,PLVM在四个数据集上均取得了最先进的性能。例如,在类别不平衡的Spouse数据集中,其F1分数比Snorkel高出22个百分点。PLVM支持即插即用,可作为现有弱监督框架(如Snorkel)的直接替代方案,或用作更复杂算法的基准模型,从而为从业者带来显著的精度提升。