Learning effective sentence representations is crucial for many Natural Language Processing (NLP) tasks, including semantic search, semantic textual similarity (STS), and clustering. While multiple transformer models have been developed for sentence embedding learning, these models may not perform optimally when dealing with specialized domains like aviation, which has unique characteristics such as technical jargon, abbreviations, and unconventional grammar. Furthermore, the absence of labeled datasets makes it difficult to train models specifically for the aviation domain. To address these challenges, we propose a novel approach for adapting sentence transformers for the aviation domain. Our method is a two-stage process consisting of pre-training followed by fine-tuning. During pre-training, we use Transformers and Sequential Denoising AutoEncoder (TSDAE) with aviation text data as input to improve the initial model performance. Subsequently, we fine-tune our models using a Natural Language Inference (NLI) dataset in the Sentence Bidirectional Encoder Representations from Transformers (SBERT) architecture to mitigate overfitting issues. Experimental results on several downstream tasks show that our adapted sentence transformers significantly outperform general-purpose transformers, demonstrating the effectiveness of our approach in capturing the nuances of the aviation domain. Overall, our work highlights the importance of domain-specific adaptation in developing high-quality NLP solutions for specialized industries like aviation.
翻译:学习有效的句子表征对于诸多自然语言处理(NLP)任务(包括语义搜索、语义文本相似度(STS)及聚类)至关重要。尽管已有多种变换器模型被开发用于句子嵌入学习,但这些模型在处理航空等专业领域时可能表现不佳,因为航空领域具有技术术语、缩写词及非常规语法等独特特征。此外,标注数据集的缺失使得专门为航空领域训练模型变得困难。为应对这些挑战,我们提出了一种面向航空领域适配句子变换器的新方法。该方法采用两阶段流程,包括预训练与微调。在预训练阶段,我们以航空文本数据作为输入,使用变换器与序列去噪自编码器(TSDAE)来提升初始模型性能。随后,我们在句子双向编码器表征变换器(SBERT)架构中利用自然语言推理(NLI)数据集微调模型,以缓解过拟合问题。在多个下游任务上的实验结果表明,我们适配后的句子变换器显著优于通用变换器,证明了该方法在捕捉航空领域细微特征方面的有效性。总体而言,我们的工作凸显了领域特定适配在开发面向航空等专业行业的高质量NLP解决方案中的重要性。