Language model pre-training has proven to be useful in many language understanding tasks. In this paper, we investigate whether it is still helpful to add the self-training method in the pre-training step and the fine-tuning step. Towards this goal, we propose a learning framework that making best use of the unlabel data on the low-resource and high-resource labeled dataset. In industry NLP applications, we have large amounts of data produced by users or customers. Our learning framework is based on this large amounts of unlabel data. First, We use the model fine-tuned on manually labeled dataset to predict pseudo labels for the user-generated unlabeled data. Then we use the pseudo labels to supervise the task-specific training on the large amounts of user-generated data. We consider this task-specific training step on pseudo labels as a pre-training step for the next fine-tuning step. At last, we fine-tune on the manually labeled dataset upon the pre-trained model. In this work, we first empirically show that our method is able to solidly improve the performance by 3.6%, when the manually labeled fine-tuning dataset is relatively small. Then we also show that our method still is able to improve the performance further by 0.2%, when the manually labeled fine-tuning dataset is relatively large enough. We argue that our method make the best use of the unlabel data, which is superior to either pre-training or self-training alone.
翻译:语言模型预训练已被证明在多种语言理解任务中具有重要价值。本文探究在预训练阶段和微调阶段引入自训练方法是否仍能带来性能提升。为此,我们提出一种学习框架,旨在低资源与高资源标注数据集上最大化利用无标签数据。在工业级自然语言处理应用中,用户或客户会产生海量数据。我们的学习框架即基于此类大规模无标签数据展开:首先,使用人工标注数据集微调后的模型对用户生成的无标签数据进行伪标签预测;继而,利用这些伪标签指导针对大规模用户生成数据的任务特定训练——我们将此基于伪标签的任务特定训练步骤视作后续微调阶段的预训练环节;最后,在预训练模型基础上,使用人工标注数据集进行微调。通过实验,我们首先证明:当人工标注微调数据集规模较小时,本方法可稳定提升3.6%的性能;继而表明:即使人工标注微调数据集规模足够大,本方法仍能进一步带来0.2%的性能增益。我们认为,本方法能最优化利用无标签数据,其效果优于单纯采用预训练或自训练方法。