BioBridge: Unified Bio-Embedding with Bridging Modality in Code-Switched EMR

Pediatric Emergency Department (PED) overcrowding presents a significant global challenge, prompting the need for efficient solutions. This paper introduces the BioBridge framework, a novel approach that applies Natural Language Processing (NLP) to Electronic Medical Records (EMRs) in written free-text form to enhance decision-making in PED. In non-English speaking countries, such as South Korea, EMR data is often written in a Code-Switching (CS) format that mixes the native language with English, with most code-switched English words having clinical significance. The BioBridge framework consists of two core modules: "bridging modality in context" and "unified bio-embedding." The "bridging modality in context" module improves the contextual understanding of bilingual and code-switched EMRs. In the "unified bio-embedding" module, the knowledge of the model trained in the medical domain is injected into the encoder-based model to bridge the gap between the medical and general domains. Experimental results demonstrate that the proposed BioBridge significantly performance traditional machine learning and pre-trained encoder-based models on several metrics, including F1 score, area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC), and Brier score. Specifically, BioBridge-XLM achieved enhancements of 0.85% in F1 score, 0.75% in AUROC, and 0.76% in AUPRC, along with a notable 3.04% decrease in the Brier score, demonstrating marked improvements in accuracy, reliability, and prediction calibration over the baseline XLM model. The source code will be made publicly available.

翻译：儿科急诊室人满为患是一个全球性的重大挑战，亟需高效解决方案。本文提出了BioBridge框架，这是一种创新方法，通过将自然语言处理技术应用于自由文本形式的电子病历，以提升儿科急诊室的决策效率。在非英语国家（如韩国），电子病历数据常以代码转换格式书写，即混合使用本国语言与英语，且大多数代码转换的英语词汇具有临床意义。BioBridge框架包含两个核心模块："上下文跨模态桥接"与"统一生物医学嵌入"。"上下文跨模态桥接"模块增强了对双语及代码转换电子病历的语境理解能力；在"统一生物医学嵌入"模块中，通过将医学领域训练获得的知识注入基于编码器的模型，弥合了医学领域与通用领域之间的鸿沟。实验结果表明，所提出的BioBridge框架在多项指标上显著优于传统机器学习模型及基于预训练编码器的模型，这些指标包括F1分数、受试者工作特征曲线下面积、精确率-召回率曲线下面积以及Brier分数。具体而言，BioBridge-XLM模型在F1分数上提升了0.85%，AUROC提升了0.75%，AUPRC提升了0.76%，同时Brier分数显著降低了3.04%，相较于基线XLM模型在准确性、可靠性和预测校准方面均展现出明显改进。本研究的源代码将公开提供。