Neural sequence labeling (NSL) aims at assigning labels for input language tokens, which covers a broad range of applications, such as named entity recognition (NER) and slot filling, etc. However, the satisfying results achieved by traditional supervised-based approaches heavily depend on the large amounts of human annotation data, which may not be feasible in real-world scenarios due to data privacy and computation efficiency issues. This paper presents SeqUST, a novel uncertain-aware self-training framework for NSL to address the labeled data scarcity issue and to effectively utilize unlabeled data. Specifically, we incorporate Monte Carlo (MC) dropout in Bayesian neural network (BNN) to perform uncertainty estimation at the token level and then select reliable language tokens from unlabeled data based on the model confidence and certainty. A well-designed masked sequence labeling task with a noise-robust loss supports robust training, which aims to suppress the problem of noisy pseudo labels. In addition, we develop a Gaussian-based consistency regularization technique to further improve the model robustness on Gaussian-distributed perturbed representations. This effectively alleviates the over-fitting dilemma originating from pseudo-labeled augmented data. Extensive experiments over six benchmarks demonstrate that our SeqUST framework effectively improves the performance of self-training, and consistently outperforms strong baselines by a large margin in low-resource scenarios
翻译:神经序列标注(NSL)旨在为输入语言标记分配标签,涵盖广泛的应用,如命名实体识别(NER)和槽位填充等。然而,传统监督式方法所取得的令人满意的结果在很大程度上依赖于大量人工标注数据,这在现实场景中可能因数据隐私和计算效率问题而难以实现。本文提出了SeqUST——一种新颖的面向NSL的不确定性感知自训练框架,旨在解决标注数据稀缺问题并有效利用无标注数据。具体而言,我们通过在贝叶斯神经网络(BNN)中引入蒙特卡洛(MC)Dropout来在标记级别进行不确定性估计,然后基于模型置信度与确定性从无标注数据中选择可靠的语言标记。一个设计良好的带噪声鲁棒损失的掩码序列标注任务支持鲁棒训练,旨在抑制嘈杂伪标签的问题。此外,我们开发了一种基于高斯的一致正则化技术,以进一步提升模型在高斯分布扰动表示上的鲁棒性。这有效缓解了源自伪标签增强数据的过拟合困境。在六个基准数据集上的大量实验表明,我们的SeqUST框架有效提升了自训练的性能,并在低资源场景下以较大幅度持续优于强基线方法。