Relation Extraction Capabilities of LLMs on Clinical Text: A Bilingual Evaluation for English and Turkish

The scarcity of annotated datasets for clinical information extraction in non-English languages hinders the evaluation of large language model (LLM)-based methods developed primarily in English. In this study, we present the first comprehensive bilingual evaluation of LLMs for the clinical Relation Extraction (RE) task in both English and Turkish. To facilitate this evaluation, we introduce the first English-Turkish parallel clinical RE dataset, derived and carefully curated from the 2010 i2b2/VA relation classification corpus. We systematically assess a diverse set of prompting strategies, including multiple in-context learning (ICL) and Chain-of-Thought (CoT) approaches, and compare their performance to fine-tuned baselines such as PURE. Furthermore, we propose Relation-Aware Retrieval (RAR), a novel in-context example selection method based on contrastive learning, that is specifically designed to capture both sentence-level and relation-level semantics. Our results show that prompting-based LLM approaches consistently outperform traditional fine-tuned models. Moreover, evaluations for English performed better than their Turkish counterparts across all evaluated LLMs and prompting techniques. Among ICL methods, RAR achieves the highest performance, with Gemini 1.5 Flash reaching a micro-F1 score of 0.906 in English and 0.888 in Turkish. Performance further improves to 0.918 F1 in English when RAR is combined with a structured reasoning prompt using the DeepSeek-V3 model. These findings highlight the importance of high-quality demonstration retrieval and underscore the potential of advanced retrieval and prompting techniques to bridge resource gaps in clinical natural language processing.

翻译：非英语语言中临床信息抽取标注数据集的稀缺性，阻碍了主要在英语环境下开发的大型语言模型（LLM）方法的评估。本研究首次对英语和土耳其语临床关系抽取（RE）任务中的LLMs进行了全面的双语评估。为促进此项评估，我们引入了首个英-土平行临床RE数据集，该数据集源自2010年i2b2/VA关系分类语料库，并经过精心整理。我们系统评估了一系列多样化的提示策略，包括多种上下文学习（ICL）和思维链（CoT）方法，并将其性能与PURE等微调基线模型进行了比较。此外，我们提出了关系感知检索（RAR），这是一种基于对比学习的新型上下文示例选择方法，专门设计用于捕捉句子级和关系级语义。我们的结果表明，基于提示的LLM方法持续优于传统的微调模型。此外，在所有评估的LLMs和提示技术中，英语评估的表现均优于其对应的土耳其语评估。在ICL方法中，RAR取得了最佳性能，其中Gemini 1.5 Flash在英语中达到了0.906的微平均F1分数，在土耳其语中达到0.888。当RAR与使用DeepSeek-V3模型的结构化推理提示结合时，英语性能进一步提升至0.918 F1分数。这些发现凸显了高质量示例检索的重要性，并强调了先进检索与提示技术在弥合临床自然语言处理资源差距方面的潜力。