VietMed-MCQ: A Consistency-Filtered Data Synthesis Framework for Vietnamese Traditional Medicine Evaluation

from arxiv, The authors have withdrawn this article because the current version is still undergoing substantial revision. Several components of the data synthesis framework, consistency-filtering procedure, evaluation protocol, and experimental analysis are being refined and expanded. As a result, the current manuscript should not be considered a complete or final representation of the work

Large Language Models (LLMs) have demonstrated remarkable proficiency in general medical domains. However, their performance significantly degrades in specialized, culturally specific domains such as Vietnamese Traditional Medicine (VTM), primarily due to the scarcity of high-quality, structured benchmarks. In this paper, we introduce VietMed-MCQ, a novel multiple-choice question dataset generated via a Retrieval-Augmented Generation (RAG) pipeline with an automated consistency check mechanism. Unlike previous synthetic datasets, our framework incorporates a dual-model validation approach to ensure reasoning consistency through independent answer verification, though the substring-based evidence checking has known limitations. The complete dataset of 3,190 questions spans three difficulty levels and underwent validation by one medical expert and four students, achieving 94.2 percent approval with substantial inter-rater agreement (Fleiss' kappa = 0.82). We benchmark seven open-source models on VietMed-MCQ. Results reveal that general-purpose models with strong Chinese priors outperform Vietnamese-centric models, highlighting cross-lingual conceptual transfer, while all models still struggle with complex diagnostic reasoning. Our code and dataset are publicly available to foster research in low-resource medical domains.

翻译：大语言模型（LLM）在通用医学领域展现出卓越能力，但在越南传统医学（VTM）等专业文化特定领域中，其性能显著下降，主要原因是缺乏高质量、结构化的基准数据集。本文提出VietMed-MCQ，一种通过检索增强生成（RAG）流水线并集成自动一致性检查机制生成的新型多项选择题数据集。与既往合成数据集不同，本框架采用双模型验证方法，通过独立答案验证确保推理一致性，尽管基于子串的证据检查存在已知局限性。完整数据集包含3,190道题目，覆盖三个难度层级，并经过一位医学专家和四名学生的验证，获得94.2%的通过率及高度评分者间信度（Fleiss' kappa = 0.82）。我们在VietMed-MCQ上对七个开源模型进行基准测试，结果表明：具有强中文先验知识的通用模型优于越南语专属模型，揭示了跨语言概念迁移现象，而所有模型在复杂诊断推理中仍存在困难。我们公开代码与数据集，以促进低资源医学领域的研究。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

基于大语言模型的医疗推理研究：综述与 MR-Bench 基准测试

专知会员服务

16+阅读 · 4月13日

【AAAI2026】FinRpt：面向证券研究报告生成的数据集、评测体系与基于大语言模型的多智能体框架

专知会员服务

20+阅读 · 2025年11月11日

多模态检索增强生成的综合综述

专知会员服务

44+阅读 · 2025年2月17日

定制化大型语言模型的图检索增强生成综述

专知会员服务

38+阅读 · 2025年1月28日