Disease name normalization is an important task in the medical domain. It classifies disease names written in various formats into standardized names, serving as a fundamental component in smart healthcare systems for various disease-related functions. Nevertheless, the most significant obstacle to existing disease name normalization systems is the severe shortage of training data. Consequently, we present a novel data augmentation approach that includes a series of data augmentation techniques and some supporting modules to help mitigate the problem. Our proposed methods rely on the Structural Invariance property of disease names and the Hierarchy property of the disease classification system. The goal is to equip the models with extensive understanding of the disease names and the hierarchical structure of the disease name classification system. Through extensive experimentation, we illustrate that our proposed approach exhibits significant performance improvements across various baseline models and training objectives, particularly in scenarios with limited training data.
翻译:疾病名称归一化是医学领域的一项重要任务。该任务将各种格式书写的疾病名称归类为标准名称,作为智能医疗系统中各类疾病相关功能的基础组件。然而,现有疾病名称归一化系统面临的最大障碍是训练数据的严重短缺。为此,我们提出了一种新颖的数据增强方法,该方法包含一系列数据增强技术及若干辅助模块,旨在缓解这一问题。所提出的方法依赖于疾病名称的结构不变性(Structural Invariance)特性和疾病分类体系的层次性(Hierarchy)特性,其目标是使模型能够深入理解疾病名称及其分类体系的层级结构。通过大量实验,我们证明了所提出的方法在各种基线模型和训练目标下均展现出显著的性能提升,特别是在训练数据有限的场景中效果尤为突出。