Common Data Elements (CDEs) standardize data collection and sharing across studies, enhancing data interoperability and improving research reproducibility. However, implementing CDEs presents challenges due to the broad range and variety of data elements. This study aims to develop an effective and efficient mapping tool to bridge the gap between local data elements and National Institutes of Health (NIH) CDEs. We propose CDEMapper, a large language model (LLM) powered mapping tool designed to assist in mapping local data elements to NIH CDEs. CDEMapper has three core modules: (1) CDE indexing and embeddings. NIH CDEs were indexed and embedded to support semantic search; (2) CDE recommendations. The tool combines Elasticsearch (BM25 similarity methods) with state of the art GPT services to recommend candidate CDEs and their permissible values; and (3) Human review. Users review and select the NIH CDEs and values that best match their data elements and value sets. We evaluate the tool recommendation accuracy against manually annotated mapping results. CDEMapper offers a publicly available, LLM-powered, and intuitive user interface that consolidates essential and advanced mapping services into a streamlined pipeline. It provides a step by step, quality assured mapping workflow designed with a user-centered approach. The evaluation results demonstrated that augmenting BM25 with GPT embeddings and a ranker consistently enhances CDEMapper mapping accuracy in three different mapping settings across four evaluation datasets. This work opens up the potential of using LLMs to assist with CDE recommendation and human curation when aligning local data elements with NIH CDEs. Additionally, this effort enhances clinical research data interoperability and helps researchers better understand the gaps between local data elements and NIH CDEs.
翻译:通用数据元素(CDE)通过标准化数据收集与共享,提升了研究间的数据互操作性并增强了研究可重复性。然而,由于数据元素范围广泛且种类繁多,CDE的实施面临挑战。本研究旨在开发一种高效的数据映射工具,以弥合本地数据元素与美国国立卫生研究院(NIH)CDE之间的差距。我们提出了CDEMapper,这是一个基于大型语言模型(LLM)的映射工具,旨在协助将本地数据元素映射至NIH CDE。CDEMapper包含三个核心模块:(1)CDE索引与嵌入:对NIH CDE进行索引和向量嵌入以支持语义检索;(2)CDE推荐:该工具结合Elasticsearch(BM25相似度方法)与前沿的GPT服务,推荐候选CDE及其允许值;(3)人工审核:用户审核并选择最匹配其数据元素与值集的NIH CDE及取值。我们通过人工标注的映射结果评估了工具的推荐准确性。CDEMapper提供了公开可用的、基于LLM的直观用户界面,将基础与高级映射服务整合为标准化流程。该工具采用以用户为中心的设计理念,提供逐步推进且质量可控的映射工作流。评估结果表明,在四个评估数据集、三种不同映射场景下,通过GPT嵌入与排序器增强BM25的方法持续提升了CDEMapper的映射准确率。这项工作揭示了利用LLM辅助CDE推荐与人工审核,以实现本地数据元素与NIH CDE对齐的潜力。此外,该研究有助于提升临床研究数据互操作性,并帮助研究者更清晰地理解本地数据元素与NIH CDE之间的差异。