No previous work has studied the performance of Large Language Models (LLMs) in the context of Traditional Chinese Medicine (TCM), an essential and distinct branch of medical knowledge with a rich history. To bridge this gap, we present a TCM question dataset named TCM-QA, which comprises three question types: single choice, multiple choice, and true or false, to examine the LLM's capacity for knowledge recall and comprehensive reasoning within the TCM domain. In our study, we evaluate two settings of the LLM, zero-shot and few-shot settings, while concurrently discussing the differences between English and Chinese prompts. Our results indicate that ChatGPT performs best in true or false questions, achieving the highest precision of 0.688 while scoring the lowest precision is 0.241 in multiple-choice questions. Furthermore, we observed that Chinese prompts outperformed English prompts in our evaluations. Additionally, we assess the quality of explanations generated by ChatGPT and their potential contribution to TCM knowledge comprehension. This paper offers valuable insights into the applicability of LLMs in specialized domains and paves the way for future research in leveraging these powerful models to advance TCM.
翻译:既往研究未曾探讨大语言模型在中医领域中的表现——中医作为医学知识体系中具有悠久历史的重要独特分支,尚属空白。为弥补这一研究缺口,我们构建了名为TCM-QA的中医问题数据集,包含单选题、多选题和判断题三类题型,旨在评估大语言模型在中医领域的知识回忆与综合推理能力。本研究在零样本与少样本两种设定下评估大语言模型性能,同时探讨中英文提示词的差异。实验结果表明,ChatGPT在判断题中表现最佳,最高精确率达0.688;而在多选题中精确率最低,仅为0.241。此外,评估发现中文提示词的效果优于英文提示词。我们还进一步分析了ChatGPT生成解释的质量及其对中医知识理解的潜在贡献。本文为探究大语言模型在专业领域中的适用性提供了重要见解,并为利用这些强大模型推动中医发展指明了未来研究方向。