Recent research on large language models (LLMs) has primarily focused on their adaptation and application in specialized domains. The application of LLMs in the medical field is mainly concentrated on tasks such as the automation of medical report generation, summarization, diagnostic reasoning, and question-and-answer interactions between doctors and patients. The challenge of becoming a good teacher is more formidable than that of becoming a good student, and this study pioneers the application of LLMs in the field of medical education. In this work, we investigate the extent to which LLMs can generate medical qualification exam questions and corresponding answers based on few-shot prompts. Utilizing a real-world Chinese dataset of elderly chronic diseases, we tasked the LLMs with generating open-ended questions and answers based on a subset of sampled admission reports across eight widely used LLMs, including ERNIE 4, ChatGLM 4, Doubao, Hunyuan, Spark 4, Qwen, Llama 3, and Mistral. Furthermore, we engaged medical experts to manually evaluate these open-ended questions and answers across multiple dimensions. The study found that LLMs, after using few-shot prompts, can effectively mimic real-world medical qualification exam questions, whereas there is room for improvement in the correctness, evidence-based statements, and professionalism of the generated answers. Moreover, LLMs also demonstrate a decent level of ability to correct and rectify reference answers. Given the immense potential of artificial intelligence in the medical field, the task of generating questions and answers for medical qualification exams aimed at medical students, interns and residents can be a significant focus of future research.
翻译:近期关于大语言模型(LLMs)的研究主要集中于其在专业领域的适配与应用。LLMs在医学领域的应用主要集中在医疗报告生成自动化、摘要生成、诊断推理以及医患问答交互等任务上。成为良师比成为优生更具挑战性,本研究开创性地探索了LLMs在医学教育领域的应用。本工作中,我们探究了LLMs基于少量示例提示生成医学资格考试题目及对应答案的能力。利用真实世界的中文老年慢性病数据集,我们要求LLMs基于抽样入院报告的子集生成开放式问题与答案,测试涵盖八种广泛使用的LLMs,包括ERNIE 4、ChatGLM 4、豆包、混元、Spark 4、Qwen、Llama 3和Mistral。此外,我们邀请医学专家从多个维度对这些开放式问题与答案进行人工评估。研究发现,使用少量示例提示后,LLMs能有效模拟真实医学资格考试题目,而生成答案的正确性、循证陈述和专业性方面仍有提升空间。同时,LLMs在修正参考答案方面也展现出良好能力。鉴于人工智能在医学领域的巨大潜力,针对医学生、实习生和住院医师的医学资格考试问答生成任务,可成为未来研究的重要方向。