Creating large, good quality labeled data has become one of the major bottlenecks for developing machine learning applications. Multiple techniques have been developed to either decrease the dependence of labeled data (zero/few-shot learning, weak supervision) or to improve the efficiency of labeling process (active learning). Among those, Weak Supervision has been shown to reduce labeling costs by employing hand crafted labeling functions designed by domain experts. We propose AutoWS -- a novel framework for increasing the efficiency of weak supervision process while decreasing the dependency on domain experts. Our method requires a small set of labeled examples per label class and automatically creates a set of labeling functions to assign noisy labels to numerous unlabeled data. Noisy labels can then be aggregated into probabilistic labels used by a downstream discriminative classifier. Our framework is fully automatic and requires no hyper-parameter specification by users. We compare our approach with different state-of-the-art work on weak supervision and noisy training. Experimental results show that our method outperforms competitive baselines.
翻译:创建大规模高质量标注数据已成为机器学习应用开发的主要瓶颈之一。为降低对标注数据的依赖(零样本/少样本学习、弱监督)或提升标注流程效率(主动学习),研究者已开发多种技术。其中,弱监督通过采用领域专家手工设计的标注函数,被证实能有效降低标注成本。本文提出AutoWS——一种旨在提升弱监督流程效率并降低对领域专家依赖性的新型框架。该方法仅需每个标签类别提供少量标注样本,即可自动生成一组标注函数,为大量未标注数据赋予噪声标签。随后,这些噪声标签可被聚合为概率标签,供下游判别分类器使用。本框架完全自动化,无需用户指定超参数。我们将所提方法与弱监督及噪声训练领域的最新研究成果进行对比,实验结果表明该方法优于多个竞争基线模型。