CDEMapper: Enhancing NIH Common Data Element Normalization using Large Language Models

Yan Wang,Jimin Huang,Huan He,Vincent Zhang,Yujia Zhou,Xubing Hao,Pritham Ram,Lingfei Qian,Qianqian Xie,Ruey-Ling Weng,Fongci Lin,Yan Hu,Licong Cui,Xiaoqian Jiang,Hua Xu,Na Hong

from arxiv, 11 pages,4 figures

Common Data Elements (CDEs) standardize data collection and sharing across studies, enhancing data interoperability and improving research reproducibility. However, implementing CDEs presents challenges due to the broad range and variety of data elements. This study aims to develop an effective and efficient mapping tool to bridge the gap between local data elements and National Institutes of Health (NIH) CDEs. We propose CDEMapper, a large language model (LLM) powered mapping tool designed to assist in mapping local data elements to NIH CDEs. CDEMapper has three core modules: (1) CDE indexing and embeddings. NIH CDEs were indexed and embedded to support semantic search; (2) CDE recommendations. The tool combines Elasticsearch (BM25 similarity methods) with state of the art GPT services to recommend candidate CDEs and their permissible values; and (3) Human review. Users review and select the NIH CDEs and values that best match their data elements and value sets. We evaluate the tool recommendation accuracy against manually annotated mapping results. CDEMapper offers a publicly available, LLM-powered, and intuitive user interface that consolidates essential and advanced mapping services into a streamlined pipeline. It provides a step by step, quality assured mapping workflow designed with a user-centered approach. The evaluation results demonstrated that augmenting BM25 with GPT embeddings and a ranker consistently enhances CDEMapper mapping accuracy in three different mapping settings across four evaluation datasets. This work opens up the potential of using LLMs to assist with CDE recommendation and human curation when aligning local data elements with NIH CDEs. Additionally, this effort enhances clinical research data interoperability and helps researchers better understand the gaps between local data elements and NIH CDEs.

翻译：通用数据元素（CDE）通过标准化数据收集与共享，提升了研究间的数据互操作性并增强了研究可重复性。然而，由于数据元素范围广泛且种类繁多，CDE的实施面临挑战。本研究旨在开发一种高效的数据映射工具，以弥合本地数据元素与美国国立卫生研究院（NIH）CDE之间的差距。我们提出了CDEMapper，这是一个基于大型语言模型（LLM）的映射工具，旨在协助将本地数据元素映射至NIH CDE。CDEMapper包含三个核心模块：（1）CDE索引与嵌入：对NIH CDE进行索引和向量嵌入以支持语义检索；（2）CDE推荐：该工具结合Elasticsearch（BM25相似度方法）与前沿的GPT服务，推荐候选CDE及其允许值；（3）人工审核：用户审核并选择最匹配其数据元素与值集的NIH CDE及取值。我们通过人工标注的映射结果评估了工具的推荐准确性。CDEMapper提供了公开可用的、基于LLM的直观用户界面，将基础与高级映射服务整合为标准化流程。该工具采用以用户为中心的设计理念，提供逐步推进且质量可控的映射工作流。评估结果表明，在四个评估数据集、三种不同映射场景下，通过GPT嵌入与排序器增强BM25的方法持续提升了CDEMapper的映射准确率。这项工作揭示了利用LLM辅助CDE推荐与人工审核，以实现本地数据元素与NIH CDE对齐的潜力。此外，该研究有助于提升临床研究数据互操作性，并帮助研究者更清晰地理解本地数据元素与NIH CDE之间的差异。

相关内容

数据要素

关注 8

数据作为新型生产要素，是数字化、网络化、智能化的基础，已快速融入生产、分配、流通、消费和社会服务管理等各环节，深刻改变着生产方式、生活方式和社会治理方式。数据要素是指那些以电子形式存在的、通过计算的方式参与到生产经营活动并发挥重要价值的数据资源。在数字经济中，数据要素的角色可与传统的生产要素（如劳动力、资本和土地）相提并论。数据要素是推动数字经济发展的核心引擎，是赋能行业数字化转型和智能化升级的重要支撑，也是国家基础性战略资源。2023年正式成立的国家数据局，负责协调推进数据基础制度建设，统筹数据资源整合共享和开发利用，统筹推进数字中国、数字经济、数字社会规划和建设等，不仅体现了对数据资源的战略性管理和规范化利用的需求，也体现了国家层面对数字经济发展和数据治理的重视。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日