Disease name normalization is an important task in the medical domain. It classifies disease names written in various formats into standardized names, serving as a fundamental component in smart healthcare systems for various disease-related functions. Nevertheless, the most significant obstacle to existing disease name normalization systems is the severe shortage of training data. While data augmentation is a powerful approach for addressing data scarcity, our findings reveal that conventional data augmentation techniques often impede task performance, primarily due to the multi-axis and multi-granularity nature of disease names. Consequently, we introduce a set of customized data augmentation techniques designed to leverage the semantic information inherent in disease names. These techniques aim to enhance the model's understanding of the semantic intricacies and classification structure of disease names. Through extensive experimentation, we illustrate that our proposed plug-and-play methods not only surpass general data augmentation techniques but also exhibit significant performance improvements across various baseline models and training objectives, particularly in scenarios with limited training data. This underscores its potential for widespread application in medical language processing tasks.
翻译:疾病名称归一化是医学领域的一项重要任务。它将各种格式的疾病名称归类为标准名称,作为智能医疗系统中多种疾病相关功能的基础组件。然而,现有疾病名称归一化系统面临的最大障碍是训练数据的严重短缺。尽管数据增强是解决数据稀缺问题的有效方法,但我们的研究发现,传统数据增强技术往往会阻碍任务性能,这主要源于疾病名称的多轴性和多粒度特性。因此,我们引入了一组定制化的数据增强技术,旨在利用疾病名称中固有的语义信息。这些技术旨在增强模型对疾病名称语义复杂性和分类结构的理解。通过大量实验,我们证明所提出的即插即用方法不仅优于通用数据增强技术,而且在各种基线模型和训练目标下均表现出显著的性能提升,尤其是在训练数据有限的情况下。这凸显了其在医学语言处理任务中的广泛应用潜力。