We study extractive question-answering in the medical domain (Medical-EQA). This problem has two main challenges: (i) domain specificity, as most AI models lack necessary domain knowledge, and (ii) extraction-based answering style, which restricts most autoregressive LLMs due to potential hallucinations. To handle those challenges, we propose TOP-Training, a target-oriented pre-training paradigm that stands out among all domain adaptation techniques with two desirable features: (i) TOP-Training moves one step further than popular domain-oriented fine-tuning since it not only moves closer to the target domain, but also familiarizes itself with the target dataset, and (ii) it does not assume the existence of a large set of unlabeled instances from the target domain. Specifically, for a target Medical-EQA dataset, we extract its entities and leverage large language models (LLMs) to generate synthetic texts containing those entities; we then demonstrate that pretraining on this synthetic text data yields better performance on the target Medical-EQA benchmarks. Overall, our contributions are threefold: (i) TOP-Training, a new pretraining technique to effectively adapt LLMs to better solve a target problem, (ii) TOP-Training has a wide application scope because it does not require the target problem to have a large set of unlabeled data, and (iii) our experiments highlight the limitations of autoregressive LLMs, emphasizing TOP-Training as a means to unlock the true potential of bidirectional LLMs.
翻译:我们研究医学领域的抽取式问答任务。该问题面临两大挑战:(i) 领域特异性,因为大多数人工智能模型缺乏必要的领域知识;(ii) 基于抽取的回答方式,这限制了大多数自回归大语言模型的应用,因其可能产生幻觉。为应对这些挑战,我们提出TOP-Training,一种面向目标的预训练范式,其在所有领域适应技术中具有两个突出特性:(i) TOP-Training比流行的领域导向微调更进一步,它不仅使模型更接近目标领域,还使其熟悉目标数据集;(ii) 它不假设目标领域存在大量未标注实例。具体而言,对于目标医学抽取式问答数据集,我们提取其实体并利用大语言模型生成包含这些实体的合成文本;我们随后证明,在此合成文本数据上进行预训练能在目标医学抽取式问答基准上获得更好的性能。总体而言,我们的贡献有三方面:(i) TOP-Training是一种有效使大语言模型适应目标问题的新预训练技术;(ii) TOP-Training具有广泛的应用范围,因为它不要求目标问题具备大量未标注数据;(iii) 我们的实验揭示了自回归大语言模型的局限性,强调TOP-Training是释放双向大语言模型真正潜力的有效途径。