Applying unsupervised keyphrase methods on concepts extracted from discharge sheets

Clinical notes containing valuable patient information are written by different health care providers with various scientific levels and writing styles. It might be helpful for clinicians and researchers to understand what information is essential when dealing with extensive electronic medical records. Entities recognizing and mapping them to standard terminologies is crucial in reducing ambiguity in processing clinical notes. Although named entity recognition and entity linking are critical steps in clinical natural language processing, they can also result in the production of repetitive and low-value concepts. In other hand, all parts of a clinical text do not share the same importance or content in predicting the patient's condition. As a result, it is necessary to identify the section in which each content is recorded and also to identify key concepts to extract meaning from clinical texts. In this study, these challenges have been addressed by using clinical natural language processing techniques. In addition, in order to identify key concepts, a set of popular unsupervised key phrase extraction methods has been verified and evaluated. Considering that most of the clinical concepts are in the form of multi-word expressions and their accurate identification requires the user to specify n-gram range, we have proposed a shortcut method to preserve the structure of the expression based on TF-IDF. In order to evaluate the pre-processing method and select the concepts, we have designed two types of downstream tasks (multiple and binary classification) using the capabilities of transformer-based models. The obtained results show the superiority of proposed method in combination with SciBERT model, also offer an insight into the efficacy of general extracting essential phrase methods for clinical notes.

翻译：包含患者宝贵信息的临床笔记由不同科学水平和写作风格的医疗保健提供者撰写。了解在处理大量电子病历时哪些信息至关重要，可能对临床医生和研究人员有所帮助。将实体识别并映射到标准术语是减少临床笔记处理歧义的关键步骤。尽管命名实体识别和实体链接是临床自然语言处理的关键步骤，但它们也可能导致生成重复且低价值的概念。另一方面，临床文本的不同部分在预测患者病情方面并不具有同等重要性或内容。因此，有必要识别每条记录所对应的部分，并识别关键概念以从临床文本中提取意义。在本研究中，通过使用临床自然语言处理技术解决了这些挑战。此外，为了识别关键概念，我们验证并评估了一组流行的非监督关键词提取方法。考虑到大多临床概念以多词表达形式存在，且其准确识别需要用户指定n-gram范围，我们提出了一种基于TF-IDF的快捷方法来保留表达结构。为评估预处理方法和选择概念，我们利用基于Transformer模型的能力设计了两种类型的下游任务（多分类和二分类）。所得结果显示了所提方法与SciBERT模型结合的优势，同时为通用关键短语提取方法在临床笔记中的有效性提供了见解。