We conduct investigations on clinical text machine translation by examining multilingual neural network models using deep learning such as Transformer based structures. Furthermore, to address the language resource imbalance issue, we also carry out experiments using a transfer learning methodology based on massive multilingual pre-trained language models (MMPLMs). The experimental results on three subtasks including 1) clinical case (CC), 2) clinical terminology (CT), and 3) ontological concept (OC) show that our models achieved top-level performances in the ClinSpEn-2022 shared task on English-Spanish clinical domain data. Furthermore, our expert-based human evaluations demonstrate that the small-sized pre-trained language model (PLM) won over the other two extra-large language models by a large margin, in the clinical domain fine-tuning, which finding was never reported in the field. Finally, the transfer learning method works well in our experimental setting using the WMT21fb model to accommodate a new language space Spanish that was not seen at the pre-training stage within WMT21fb itself, which deserves more exploitation for clinical knowledge transformation, e.g. to investigate into more languages. These research findings can shed some light on domain-specific machine translation development, especially in clinical and healthcare fields. Further research projects can be carried out based on our work to improve healthcare text analytics and knowledge transformation. Our data will be openly available for research purposes at https://github.com/HECTA-UoM/ClinicalNMT
翻译:我们通过研究基于Transformer等深度学习架构的多语言神经网络模型,对临床文本机器翻译进行了系统探究。为应对语言资源分布不均问题,我们进一步采用基于大规模多语言预训练语言模型(MMPLMs)的迁移学习方法开展实验。在包含1)临床案例(CC)、2)临床术语(CT)和3)本体概念(OC)三个子任务的实验结果显示,我们的模型在ClinSpEn-2022共享任务(英语-西班牙语临床领域数据)中达到了顶尖性能。此外,基于专家的人工评估表明,在临床领域微调中,小型预训练语言模型(PLM)以显著优势超越另外两个超大型语言模型,这一发现在该领域尚属首次报道。最后,采用WMT21fb模型进行迁移学习的方法在我们的实验设定中表现优异——即使西班牙语未被纳入WMT21fb的预训练阶段,该模型仍能有效适应这一全新语言空间。这一方法值得在临床知识转化领域(如拓展至更多语种研究)中深入挖掘。这些研究成果可为特定领域的机器翻译发展提供启示,尤其是临床与医疗健康领域。基于本工作,可进一步开展改善医疗文本分析与知识转化的研究项目。我们的数据将在https://github.com/HECTA-UoM/ClinicalNMT上公开供研究使用。