In recent years, pretrained neural language models (PNLMs) have taken the field of natural language processing by storm, achieving new benchmarks and state-of-the-art performances. These models often rely heavily on annotated data, which may not always be available. Data scarcity are commonly found in specialized domains, such as medical, or in low-resource languages that are underexplored by AI research. In this dissertation, we focus on mitigating data scarcity using data augmentation and neural ensemble learning techniques for neural language models. In both research directions, we implement neural network algorithms and evaluate their impact on assisting neural language models in downstream NLP tasks. Specifically, for data augmentation, we explore two techniques: 1) creating positive training data by moving an answer span around its original context and 2) using text simplification techniques to introduce a variety of writing styles to the original training data. Our results indicate that these simple and effective solutions improve the performance of neural language models considerably in low-resource NLP domains and tasks. For neural ensemble learning, we use a multilabel neural classifier to select the best prediction outcome from a variety of individual pretrained neural language models trained for a low-resource medical text simplification task.
翻译:近年来,预训练神经语言模型(PNLMs)席卷了自然语言处理领域,取得了新的基准水平和最先进性能。这些模型通常高度依赖标注数据,而这类数据并非始终可得。数据稀缺常见于专业领域(如医学)或人工智能研究尚未充分探索的低资源语言中。在本论文中,我们专注于通过数据增强和神经集成学习技术来缓解神经语言模型的数据稀缺问题。在这两个研究方向中,我们实现了神经网络算法,并评估了其在辅助神经语言模型完成下游自然语言处理任务中的效果。具体而言,在数据增强方面,我们探索了两种技术:1)通过将答案跨度移至其原始上下文附近来创建正面训练数据;2)利用文本简化技术为原始训练数据引入多种写作风格。我们的结果表明,这些简单而有效的解决方案显著提升了神经语言模型在低资源自然语言处理领域和任务中的性能。在神经集成学习方面,我们使用多标签神经分类器,从针对低资源医学文本简化任务训练的各种独立预训练神经语言模型中选择最佳预测结果。