A Natural Language Processing Approach to Support Biomedical Data Harmonization: Leveraging Large Language Models

Biomedical research requires large, diverse samples to produce unbiased results. Automated methods for matching variables across datasets can accelerate this process. Research in this area has been limited, primarily focusing on lexical matching and ontology based semantic matching. We aimed to develop new methods, leveraging large language models (LLM) and ensemble learning, to automate variable matching. Methods: We utilized data from two GERAS cohort (European and Japan) studies to develop variable matching methods. We first manually created a dataset by matching 352 EU variables with 1322 candidate JP variables, where matched variable pairs were positive and unmatched pairs were negative instances. Using this dataset, we developed and evaluated two types of natural language processing (NLP) methods, which matched variables based on variable labels and definitions from data dictionaries: (1) LLM-based and (2) fuzzy matching. We then developed an ensemble-learning method, using the Random Forest model, to integrate individual NLP methods. RF was trained and evaluated on 50 trials. Each trial had a random split (4:1) of training and test sets, with the model's hyperparameters optimized through cross-validation on the training set. For each EU variable, 1322 candidate JP variables were ranked based on NLP-derived similarity scores or RF's probability scores, denoting their likelihood to match the EU variable. Ranking performance was measured by top-n hit ratio (HRn) and mean reciprocal rank (MRR). Results:E5 performed best among individual methods, achieving 0.90 HR-30 and 0.70 MRR. RF performed better than E5 on all metrics over 50 trials (P less than 0.001) and achieved an average HR 30 of 0.98 and MRR of 0.73. LLM-derived features contributed most to RF's performance. One major cause of errors in automatic variable matching was ambiguous variable definitions within data dictionaries.

翻译：生物医学研究需要大规模、多样化的样本以产生无偏倚的结果。跨数据集变量匹配的自动化方法可加速这一进程。该领域的研究较为有限，主要集中在基于词汇匹配和本体论的语义匹配。本研究旨在开发新方法，利用大语言模型（LLM）和集成学习实现变量自动匹配。方法：我们使用来自GERAS队列（欧洲和日本）两项研究的数据开发变量匹配方法。首先通过手动匹配352个欧洲变量与1322个候选日本变量构建数据集，其中匹配的变量对作为正例，未匹配的作为负例。基于该数据集，我们开发并评估了两种基于数据字典中变量标签和定义进行匹配的自然语言处理（NLP）方法：（1）基于LLM的方法；（2）模糊匹配方法。随后开发了使用随机森林模型的集成学习方法以整合各NLP方法。随机森林模型经过50次试验的训练与评估，每次试验采用随机划分（4:1）的训练集和测试集，并通过训练集上的交叉验证优化模型超参数。针对每个欧洲变量，1322个候选日本变量根据NLP衍生的相似度得分或随机森林的概率得分（表示其与欧洲变量的匹配可能性）进行排序。排序性能通过前n位命中率（HRn）和平均倒数排名（MRR）衡量。结果：在单一方法中，E5表现最佳，达到0.90的HR-30和0.70的MRR。随机森林在50次试验的所有指标上均优于E5（P<0.001），平均HR-30为0.98，MRR为0.73。LLM衍生的特征对随机森林性能贡献最大。自动变量匹配的主要误差来源之一是数据字典中变量定义的模糊性。