A cost-effective alternative to manual data labeling is weak supervision (WS), where data samples are automatically annotated using a predefined set of labeling functions (LFs), rule-based mechanisms that generate artificial labels for the associated classes. In this work, we investigate noise reduction techniques for WS based on the principle of k-fold cross-validation. We introduce a new algorithm ULF for Unsupervised Labeling Function correction, which denoises WS data by leveraging models trained on all but some LFs to identify and correct biases specific to the held-out LFs. Specifically, ULF refines the allocation of LFs to classes by re-estimating this assignment on highly reliable cross-validated samples. Evaluation on multiple datasets confirms ULF's effectiveness in enhancing WS learning without the need for manual labeling.
翻译:摘要:手动数据标注的一种经济高效替代方案是弱监督学习(WS),该方法通过预定义的标注函数集合(LFs)自动标注数据样本——这些基于规则的机制为相关类别生成人工标签。本文基于k折交叉验证原理研究WS的噪声抑制技术。我们提出一种名为ULF(无监督标注函数校正)的新算法,通过利用除部分标注函数外训练的模型来识别并校正特定于被排除标注函数的偏差,从而对弱监督数据进行去噪处理。具体而言,ULF通过在高可靠性交叉验证样本上重新估计标注函数-类别分配关系,精炼了标注函数与类别的对应关系。多数据集评估证实,ULF无需人工标注即可有效提升弱监督学习性能。