Large Language Models (LLMs) have demonstrated remarkable proficiency in general medical domains. However, their performance significantly degrades in specialized, culturally specific domains such as Vietnamese Traditional Medicine (VTM), primarily due to the scarcity of high-quality, structured benchmarks. In this paper, we introduce VietMed-MCQ, a novel multiple-choice question dataset generated via a Retrieval-Augmented Generation (RAG) pipeline with an automated consistency check mechanism. Unlike previous synthetic datasets, our framework incorporates a dual-model validation approach to ensure reasoning consistency through independent answer verification, though the substring-based evidence checking has known limitations. The complete dataset of 3,190 questions spans three difficulty levels and underwent validation by one medical expert and four students, achieving 94.2 percent approval with substantial inter-rater agreement (Fleiss' kappa = 0.82). We benchmark seven open-source models on VietMed-MCQ. Results reveal that general-purpose models with strong Chinese priors outperform Vietnamese-centric models, highlighting cross-lingual conceptual transfer, while all models still struggle with complex diagnostic reasoning. Our code and dataset are publicly available to foster research in low-resource medical domains.
翻译:大语言模型(LLM)在通用医学领域展现出卓越能力,但在越南传统医学(VTM)等专业文化特定领域中,其性能显著下降,主要原因是缺乏高质量、结构化的基准数据集。本文提出VietMed-MCQ,一种通过检索增强生成(RAG)流水线并集成自动一致性检查机制生成的新型多项选择题数据集。与既往合成数据集不同,本框架采用双模型验证方法,通过独立答案验证确保推理一致性,尽管基于子串的证据检查存在已知局限性。完整数据集包含3,190道题目,覆盖三个难度层级,并经过一位医学专家和四名学生的验证,获得94.2%的通过率及高度评分者间信度(Fleiss' kappa = 0.82)。我们在VietMed-MCQ上对七个开源模型进行基准测试,结果表明:具有强中文先验知识的通用模型优于越南语专属模型,揭示了跨语言概念迁移现象,而所有模型在复杂诊断推理中仍存在困难。我们公开代码与数据集,以促进低资源医学领域的研究。