Clinical information systems have become large repositories for semi-structured and partly annotated electronic health record data, which have reached a critical mass that makes them interesting for supervised data-driven neural network approaches. We explored automated coding of 50 character long clinical problem list entries using the International Classification of Diseases (ICD-10) and evaluated three different types of network architectures on the top 100 ICD-10 three-digit codes. A fastText baseline reached a macro-averaged F1-score of 0.83, followed by a character-level LSTM with a macro-averaged F1-score of 0.84. The top performing approach used a downstreamed RoBERTa model with a custom language model, yielding a macro-averaged F1-score of 0.88. A neural network activation analysis together with an investigation of the false positives and false negatives unveiled inconsistent manual coding as a main limiting factor.
翻译:临床信息系统已成为半结构化和部分标注电子健康记录数据的大型存储库,其数据规模已达临界点,使得基于监督数据驱动的神经网络方法具有研究价值。我们探索了使用国际疾病分类(ICD-10)对50字符长的临床问题列表条目进行自动编码,并在前100个ICD-10三位数代码上评估了三种不同网络架构。fastText基线方法实现了0.83的宏平均F1分数,字符级LSTM紧随其后,宏平均F1分数为0.84。性能最优的方法采用了下游化的RoBERTa模型结合自定义语言模型,宏平均F1分数达到0.88。神经网络激活分析结合假阳性和假阴性案例的研究表明,不一致的人工编码是主要限制因素。