To reduce the human annotation efforts, the programmatic weak supervision (PWS) paradigm abstracts weak supervision sources as labeling functions (LFs) and involves a label model to aggregate the output of multiple LFs to produce training labels. Most existing label models require a parameter learning step for each dataset. In this work, we present a hyper label model that (once learned) infers the ground-truth labels for each dataset in a single forward pass without dataset-specific parameter learning. The hyper label model approximates an optimal analytical (yet computationally intractable) solution of the ground-truth labels. We train the model on synthetic data generated in the way that ensures the model approximates the analytical optimal solution, and build the model upon Graph Neural Network (GNN) to ensure the model prediction being invariant (or equivariant) to the permutation of LFs (or data points). On 14 real-world datasets, our hyper label model outperforms the best existing methods in both accuracy (by 1.4 points on average) and efficiency (by six times on average). Our code is available at https://github.com/wurenzhi/hyper_label_model
翻译:为了减少人工标注成本,程序化弱监督(PWS)范式将弱监督源抽象为标注函数(LF),并通过标签模型聚合多个标注函数的输出以生成训练标签。现有的大多数标签模型需要针对每个数据集进行参数学习。本文提出了一种超标签模型,该模型(一经学习)可通过单次前向传播推断每个数据集的真实标签,无需针对特定数据集进行参数学习。超标签模型逼近了真实标签的最优解析解(虽在计算上不可行)。我们通过在合成数据上训练模型,确保模型能逼近解析最优解,并基于图神经网络(GNN)构建模型,使模型预测对标注函数(或数据点)的排列具有不变性(或等变性)。在14个真实数据集上,我们的超标签模型在准确率(平均提升1.4个点)和效率(平均提升6倍)上均优于现有最佳方法。代码已开源:https://github.com/wurenzhi/hyper_label_model