Introduction: Clinical text classification using natural language processing (NLP) models requires adequate training data to achieve optimal performance. For that, 200-500 documents are typically annotated. The number is constrained by time and costs and lacks justification of the sample size requirements and their relationship to text vocabulary properties. Methods: Using the publicly available MIMIC-III dataset containing hospital discharge notes with ICD-9 diagnoses as labels, we employed pre-trained BERT embeddings followed by Random Forest classifiers to identify 10 randomly selected diagnoses, varying training corpus sizes from 100 to 10,000 documents, and analyzed vocabulary properties by identifying strong and noisy predictive words through Lasso logistic regression on bag-of-words embeddings. Results: Learning curves varied significantly across the 10 classification tasks despite identical preprocessing and algorithms, with 600 documents sufficient to achieve 95% of the performance attainable with 10,000 documents for all tasks. Vocabulary analysis revealed that more strong predictors and fewer noisy predictors were associated with steeper learning curves, where every 100 additional noisy words decreased accuracy by approximately 0.02 while 100 additional strong predictors increased maximum accuracy by approximately 0.04.
翻译:引言:使用自然语言处理(NLP)模型进行临床文本分类需要充足的训练数据以实现最佳性能。为此,通常需要标注200-500份文档。这一数量受限于时间和成本,并且缺乏对样本量要求及其与文本词汇特性关系的论证。方法:利用包含以ICD-9诊断作为标签的医院出院记录的公开数据集MIMIC-III,我们采用预训练的BERT嵌入向量,随后使用随机森林分类器来识别10个随机选择的诊断。通过将训练语料库规模从100份文档变化到10,000份文档,并基于词袋嵌入通过Lasso逻辑回归识别强预测词和噪声预测词来分析词汇特性。结果:尽管预处理和算法相同,10个分类任务的学习曲线存在显著差异。对于所有任务,600份文档足以达到使用10,000份文档所能获得性能的95%。词汇分析表明,更多的强预测词和更少的噪声预测词与更陡峭的学习曲线相关,其中每增加100个噪声词会使准确率降低约0.02,而每增加100个强预测词则会使最大准确率提高约0.04。