We introduce an in-domain supervised pipeline designed to counter the out-of-distribution performance drop that hampers supervised biomedical NLP models, a problem observed when models trained on pathology reports are moved across cancer registries. Our contribution is a reproducible recipe for training a supervised classifier from routinely collected cancer registry data. It describes how to build the in-domain training set and a production-matched holdout, and to choose operating points that keep the false-negative rate (FNR) very low while keeping reviewer workload manageable. The pipeline standardizes data curation with facility-stratified sampling and separate handling of reports linked to registry cases, and includes a blinded manual audit to estimate positive-case prevalence and label noise. On a 418k-report holdout set, the Kentucky model achieved FNR 0.003 and false-positive rate (FPR) 0.097, improving over the Seattle-trained MOSSAIC OncoID baseline (FNR 0.010, FPR 0.183) and raising F1 from 0.860 to 0.922. In a blinded manual review of 600 reports, estimated positive prevalence declined from 0.500 to 0.398, indicating substantial label noise with errors concentrated in rare primary sites.
翻译:我们提出一种面向领域的有监督流程,旨在缓解阻碍生物医学自然语言处理模型在分布外场景下性能下降的问题,该现象在病理报告模型跨癌症登记处迁移时尤为显著。本研究的贡献在于提供一套基于常规收集的癌症登记数据训练有监督分类器的可复现方案。该方案详细描述了如何构建领域内训练集与生产级匹配的保留集,并选择在保持极低假阴性率(FNR)的同时控制审阅者工作量的操作点。该流程通过设施分层采样、单独处理与登记病例关联的报告来标准化数据整理,并引入盲法人工审计以估算阳性病例患病率与标注噪声。在包含418,000份报告的保留集上,肯塔基州模型实现了FNR 0.003与假阳性率(FPR)0.097,较西雅图训练的MOSSAIC OncoID基线(FNR 0.010,FPR 0.183)显著提升,F1分数从0.860提高至0.922。在对600份报告进行的盲法人工审阅中,估算阳性患病率从0.500降至0.398,表明存在显著标注噪声,且错误集中于罕见原发部位。