Clinical trials (CTs) often fail due to inadequate patient recruitment. This paper tackles the challenges of CT retrieval by presenting an approach that addresses the patient-to-trials paradigm. Our approach involves two key components in a pipeline-based model: (i) a data enrichment technique for enhancing both queries and documents during the first retrieval stage, and (ii) a novel re-ranking schema that uses a Transformer network in a setup adapted to this task by leveraging the structure of the CT documents. We use named entity recognition and negation detection in both patient description and the eligibility section of CTs. We further classify patient descriptions and CT eligibility criteria into current, past, and family medical conditions. This extracted information is used to boost the importance of disease and drug mentions in both query and index for lexical retrieval. Furthermore, we propose a two-step training schema for the Transformer network used to re-rank the results from the lexical retrieval. The first step focuses on matching patient information with the descriptive sections of trials, while the second step aims to determine eligibility by matching patient information with the criteria section. Our findings indicate that the inclusion criteria section of the CT has a great influence on the relevance score in lexical models, and that the enrichment techniques for queries and documents improve the retrieval of relevant trials. The re-ranking strategy, based on our training schema, consistently enhances CT retrieval and shows improved performance by 15\% in terms of precision at retrieving eligible trials. The results of our experiments suggest the benefit of making use of extracted entities. Moreover, our proposed re-ranking schema shows promising effectiveness compared to larger neural models, even with limited training data.
翻译:临床试验常因患者招募不足而失败。本文针对临床试验检索难题,提出一种基于患者-试验匹配范式的方法。我们的方法采用流水线模型,包含两个关键组件:(i)在首次检索阶段通过数据增强技术优化查询与文档;(ii)创新性地设计重排序方案,利用Transformer网络适配任务特性,充分挖掘临床试验文档的结构化信息。我们在患者描述和临床试验资格标准部分同时应用命名实体识别与否定检测技术。进一步将患者描述与临床试验资格标准细分为当前、既往和家族病史三类。这些提取的信息用于增强词法检索中疾病和药物提及在查询与索引中的权重。此外,我们提出了一种两步训练范式,用于优化重排序阶段的Transformer网络:第一步聚焦患者信息与试验描述部分的匹配,第二步则通过患者信息与资格标准部分的比对判定入组资格。研究发现,临床试验的纳入标准部分对词法模型的相关性评分具有显著影响,而查询与文档的增强技术能有效提升相关试验的检索效果。基于我们训练范式的重排序策略持续优化了临床试验检索,在合格试验检索精度上实现了15%的性能提升。实验结果表明,利用提取的实体信息具有显著优势。更关键的是,即便在训练数据有限的情况下,我们提出的重排序方案仍展现出优于大型神经模型的检索效能。