Biomedical data and benchmarks are highly valuable yet very limited in low-resource languages other than English such as Vietnamese. In this paper, we make use of a state-of-the-art translation model in English-Vietnamese to translate and produce both pretrained as well as supervised data in the biomedical domains. Thanks to such large-scale translation, we introduce ViPubmedT5, a pretrained Encoder-Decoder Transformer model trained on 20 million translated abstracts from the high-quality public PubMed corpus. ViPubMedT5 demonstrates state-of-the-art results on two different biomedical benchmarks in summarization and acronym disambiguation. Further, we release ViMedNLI - a new NLP task in Vietnamese translated from MedNLI using the recently public En-vi translation model and carefully refined by human experts, with evaluations of existing methods against ViPubmedT5.
翻译:生物医学数据和基准在除英语外的低资源语言(如越南语)中极具价值但非常有限。本文利用最先进的英越翻译模型,翻译并生成生物医学领域的预训练数据和监督数据。通过此类大规模翻译,我们提出了ViPubmedT5——一种基于高质量公共PubMed语料库中2000万篇翻译摘要预训练的编码器-解码器Transformer模型。ViPubMedT5在摘要生成和缩写消歧两个不同生物医学基准任务中展现出最先进的性能。此外,我们还发布了ViMedNLI——一种越南语自然语言推理新任务,该任务通过近期公开的英越翻译模型从MedNLI翻译而来,并经人类专家精心精炼,同时评估了现有方法及ViPubmedT5在该任务上的表现。