We present E3C-3.0, a multilingual dataset in the medical domain, comprising clinical cases annotated with diseases and test-result relations. The dataset includes both native texts in five languages (English, French, Italian, Spanish and Basque) and texts translated and projected from the English source into five target languages (Greek, Italian, Polish, Slovak, and Slovenian). A semi-automatic approach has been implemented, including automatic annotation projection based on Large Language Models (LLMs) and human revision. We present several experiments showing that current state-of-the-art LLMs can benefit from being fine-tuned on the E3C-3.0 dataset. We also show that transfer learning in different languages is very effective, mitigating the scarcity of data. Finally, we compare performance both on native data and on projected data. We release the data at https://huggingface.co/collections/NLP-FBK/e3c-projected-676a7d6221608d60e4e9fd89 .
翻译:我们提出了E3C-3.0,一个医学领域的多语言数据集,包含标注了疾病与检测结果关系的临床病例。该数据集包含五种语言(英语、法语、意大利语、西班牙语和巴斯克语)的原始文本,以及从英语源文本翻译并投影到五种目标语言(希腊语、意大利语、波兰语、斯洛伐克语和斯洛文尼亚语)的文本。我们采用了一种半自动方法,包括基于大语言模型(LLMs)的自动标注投影和人工修订。我们展示了多项实验,表明当前最先进的大语言模型可以通过在E3C-3.0数据集上进行微调而受益。我们还证明了跨语言的迁移学习非常有效,能够缓解数据稀缺问题。最后,我们比较了在原始数据和投影数据上的性能。数据发布于 https://huggingface.co/collections/NLP-FBK/e3c-projected-676a7d6221608d60e4e9fd89。