Deep learning approaches exhibit promising performances on various text tasks. However, they are still struggling on medical text classification since samples are often extremely imbalanced and scarce. Different from existing mainstream approaches that focus on supplementary semantics with external medical information, this paper aims to rethink the data challenges in medical texts and present a novel framework-agnostic algorithm called Text2Tree that only utilizes internal label hierarchy in training deep learning models. We embed the ICD code tree structure of labels into cascade attention modules for learning hierarchy-aware label representations. Two new learning schemes, Similarity Surrogate Learning (SSL) and Dissimilarity Mixup Learning (DML), are devised to boost text classification by reusing and distinguishing samples of other labels following the label representation hierarchy, respectively. Experiments on authoritative public datasets and real-world medical records show that our approach stably achieves superior performances over classical and advanced imbalanced classification methods.
翻译:深度学习方法在各种文本任务中展现出优异性,但在医学文本分类中仍面临挑战,因为样本往往极度不平衡且稀缺。与依赖外部医学信息补充语义的主流方法不同,本文旨在重新审视医学文本中的数据挑战,提出一种新型框架无关算法——Text2Tree,该算法仅利用内部标签层次即可训练深度学习模型。我们将ICD编码标签的树形结构嵌入级联注意力模块中,用于学习层次感知的标签表示。通过设计两种新学习策略——相似性代理学习(SSL)与差异性混合学习(DML),分别遵循标签表示层次复用其他类别样本和区分样本,从而提升文本分类性能。在权威公开数据集和真实医疗记录上的实验表明,相较于经典及先进的不平衡分类方法,本方法性能稳定且表现优异。