In this paper, we release a largest ever medical Question Answering (QA) dataset with 26 million QA pairs. We benchmark many existing approaches in our dataset in terms of both retrieval and generation. Experimental results show that the existing models perform far lower than expected and the released dataset is still challenging in the pre-trained language model era. Moreover, we also experimentally show the benefit of the proposed dataset in many aspects: (i) trained models for other QA datasets in a zero-shot fashion; and (ii) as external knowledge for retrieval-augmented generation (RAG); and (iii) improving existing pre-trained language models by using the QA pairs as a pre-training corpus in continued training manner. We believe that this dataset will not only contribute to medical research but also facilitate both the patients and clinical doctors. See \url{https://github.com/FreedomIntelligence/Huatuo-26M}.
翻译:本文发布了一个迄今为止规模最大的医学问答数据集,包含2600万个问答对。我们在该数据集上对现有方法进行了检索和生成两个方面的基准测试。实验结果表明,现有模型的性能远低于预期,该数据集在预训练语言模型时代仍具有挑战性。此外,我们还通过实验展示了所提数据集在多个方面的优势:(i)以零样本方式为其他问答数据集训练模型;(ii)作为检索增强生成的外部知识;(iii)通过将问答对作为持续训练的预训练语料库,改进现有的预训练语言模型。我们相信,该数据集不仅将促进医学研究,还将惠及患者和临床医生。详见 \url{https://github.com/FreedomIntelligence/Huatuo-26M}。