Cohort discovery is a crucial step in clinical research on Electronic Health Record (EHR) data. Temporal queries, which are common in cohort discovery, can be time-consuming and prone to errors when processed on large EHR datasets. In this work, we introduce TELII, a temporal event level inverted indexing method designed for cohort discovery on large EHR datasets. TELII is engineered to pre-compute and store the relations along with the time difference between events, thereby providing fast and accurate temporal query capabilities. We implemented TELII for the OPTUM de-identified COVID-19 EHR dataset, which contains data from 8.87 million patients. We demonstrate four common temporal query tasks and their implementation using TELII with a MongoDB backend. Our results show that the temporal query speed for TELII is up to 2000 times faster than that of existing non-temporal inverted indexes. TELII achieves millisecond-level response times, enabling users to quickly explore event relations and find preliminary evidence for their research questions. Not only is TELII practical and straightforward to implement, but it also offers easy adaptability to other EHR datasets. These advantages underscore TELII's potential to serve as the query engine for EHR-based applications, ensuring fast, accurate, and user-friendly query responses.
翻译:队列发现是电子健康记录数据临床研究中的关键步骤。在队列发现中常见的时间查询,若在大规模EHR数据集上处理,往往耗时且易出错。本研究提出TELII,一种专为大规模EHR数据集队列发现设计的时间事件级倒排索引方法。TELII通过预计算并存储事件间关系及其时间差,从而提供快速准确的时间查询能力。我们在包含887万患者数据的OPTUM去标识化新冠EHR数据集上实现了TELII,并通过MongoDB后端展示了四种常见时间查询任务及其实现方案。实验结果表明,TELII的时间查询速度较现有非时间倒排索引提升最高达2000倍,可实现毫秒级响应,使用户能快速探索事件关系并获取研究问题的初步证据。TELII不仅实施简便实用,还能轻松适配其他EHR数据集。这些优势凸显了TELII作为基于EHR应用查询引擎的潜力,能够确保快速、准确且用户友好的查询响应。