Learning effective sentence representations is crucial for many Natural Language Processing (NLP) tasks, including semantic search, semantic textual similarity (STS), and clustering. While multiple transformer models have been developed for sentence embedding learning, these models may not perform optimally when dealing with specialized domains like aviation, which has unique characteristics such as technical jargon, abbreviations, and unconventional grammar. Furthermore, the absence of labeled datasets makes it difficult to train models specifically for the aviation domain. To address these challenges, we propose a novel approach for adapting sentence transformers for the aviation domain. Our method is a two-stage process consisting of pre-training followed by fine-tuning. During pre-training, we use Transformers and Sequential Denoising AutoEncoder (TSDAE) with aviation text data as input to improve the initial model performance. Subsequently, we fine-tune our models using a Natural Language Inference (NLI) dataset in the Sentence Bidirectional Encoder Representations from Transformers (SBERT) architecture to mitigate overfitting issues. Experimental results on several downstream tasks show that our adapted sentence transformers significantly outperform general-purpose transformers, demonstrating the effectiveness of our approach in capturing the nuances of the aviation domain. Overall, our work highlights the importance of domain-specific adaptation in developing high-quality NLP solutions for specialized industries like aviation.
翻译:学习有效的句子表示对于许多自然语言处理任务至关重要,包括语义搜索、语义文本相似度度量和聚类。尽管已开发出多种变换器模型用于句子嵌入学习,但这些模型在处理航空等专业领域时可能无法达到最佳性能,因为航空领域具有技术术语、缩写和非传统语法等独特特征。此外,缺乏标注数据集使得专门为航空领域训练模型变得困难。为应对这些挑战,我们提出了一种面向航空领域适配句子变换器的新方法。该方法包含预训练和微调两个阶段。在预训练阶段,我们使用变换器和序列去噪自编码器,以航空文本数据作为输入来提升初始模型性能。随后,我们利用句子双向编码器表示变换器架构中的自然语言推理数据集对模型进行微调,以缓解过拟合问题。在多个下游任务上的实验结果表明,我们适配后的句子变换器显著优于通用变换器,证明了我们的方法在捕捉航空领域细节方面的有效性。总体而言,我们的工作凸显了针对航空等专业行业开发高质量自然语言处理解决方案时进行领域特定适配的重要性。