LeafAI: query generator for clinical cohort discovery rivaling a human programmer

Objective: Identifying study-eligible patients within clinical databases is a critical step in clinical research. However, accurate query design typically requires extensive technical and biomedical expertise. We sought to create a system capable of generating data model-agnostic queries while also providing novel logical reasoning capabilities for complex clinical trial eligibility criteria. Materials and Methods: The task of query creation from eligibility criteria requires solving several text-processing problems, including named entity recognition and relation extraction, sequence-to-sequence transformation, normalization, and reasoning. We incorporated hybrid deep learning and rule-based modules for these, as well as a knowledge base of the Unified Medical Language System (UMLS) and linked ontologies. To enable data-model agnostic query creation, we introduce a novel method for tagging database schema elements using UMLS concepts. To evaluate our system, called LeafAI, we compared the capability of LeafAI to a human database programmer to identify patients who had been enrolled in 8 clinical trials conducted at our institution. We measured performance by the number of actual enrolled patients matched by generated queries. Results: LeafAI matched a mean 43% of enrolled patients with 27,225 eligible across 8 clinical trials, compared to 27% matched and 14,587 eligible in queries by a human database programmer. The human programmer spent 26 total hours crafting queries compared to several minutes by LeafAI. Conclusions: Our work contributes a state-of-the-art data model-agnostic query generation system capable of conditional reasoning using a knowledge base. We demonstrate that LeafAI can rival an experienced human programmer in finding patients eligible for clinical trials.

翻译：目的：在临床数据库中识别符合研究条件的患者是临床研究中的关键步骤。然而，准确的查询设计通常需要广泛的技术和生物医学专业知识。我们试图创建一个能够生成数据模型无关查询的系统，同时为复杂的临床试验资格标准提供新颖的逻辑推理能力。材料与方法：根据资格标准创建查询的任务需要解决多个文本处理问题，包括命名实体识别和关系提取、序列到序列转换、标准化和推理。我们为此整合了混合深度学习与基于规则的模块，以及统一医学语言系统（UMLS）和关联本体的知识库。为实现数据模型无关的查询创建，我们引入了一种使用UMLS概念标记数据库模式元素的新方法。为评估我们名为LeafAI的系统，我们将其与人类数据库程序员的能力进行了比较，以识别在我们机构开展的8项临床试验中已入组的患者。我们通过生成查询匹配到的实际入组患者数量来衡量性能。结果：在8项临床试验中，LeafAI平均匹配了43%的入组患者（共27225名合格患者），而人类数据库程序员匹配了27%（共14587名合格患者）。人类程序员花费了总计26小时编写查询，而LeafAI仅需几分钟。结论：我们的工作贡献了一个最先进的数据模型无关查询生成系统，该系统能够利用知识库进行条件推理。我们证明LeafAI在寻找临床试验合格患者方面可以与经验丰富的人类程序员相媲美。