Large language models (LLMs) are advancing rapidly in medical NLP, yet Traditional Chinese Medicine (TCM) with its distinctive ontology, terminology, and reasoning patterns requires domain-faithful evaluation. Existing TCM benchmarks are fragmented in coverage and scale and rely on non-unified or generation-heavy scoring that hinders fair comparison. We present the LingLanMiDian (LingLan) benchmark, a large-scale, expert-curated, multi-task suite that unifies evaluation across knowledge recall, multi-hop reasoning, information extraction, and real-world clinical decision-making. LingLan introduces a consistent metric design, a synonym-tolerant protocol for clinical labels, a per-dataset 400-item Hard subset, and a reframing of diagnosis and treatment recommendation into single-choice decision recognition. We conduct comprehensive, zero-shot evaluations on 14 leading open-source and proprietary LLMs, providing a unified perspective on their strengths and limitations in TCM commonsense knowledge understanding, reasoning, and clinical decision support; critically, the evaluation on Hard subset reveals a substantial gap between current models and human experts in TCM-specialized reasoning. By bridging fundamental knowledge and applied reasoning through standardized evaluation, LingLan establishes a unified, quantitative, and extensible foundation for advancing TCM LLMs and domain-specific medical AI research. All evaluation data and code are available at https://github.com/TCMAI-BJTU/LingLan and http://tcmnlp.com.
翻译:大语言模型在医学自然语言处理领域进展迅速,但中医因其独特的本体论、术语体系和推理模式,需要忠实于领域的评估。现有的中医基准测试在覆盖范围和规模上较为零散,且依赖于非统一或生成密集型的评分方法,这阻碍了公平比较。我们提出了灵兰秘典基准测试,这是一个大规模、由专家策划的多任务套件,统一了知识回忆、多跳推理、信息抽取和真实世界临床决策制定等方面的评估。灵兰引入了一致的度量设计、一个对临床标签具有同义词容忍度的协议、每个数据集包含400个项目的困难子集,并将诊断和治疗建议重新构建为单项选择决策识别。我们对14个领先的开源和专有大语言模型进行了全面的零样本评估,为它们在中医常识知识理解、推理和临床决策支持方面的优势与局限提供了统一的视角;关键的是,在困难子集上的评估揭示了当前模型与人类专家在中医专业推理方面存在显著差距。通过标准化的评估桥接基础知识和应用推理,灵兰为推进中医大语言模型和特定领域医学人工智能研究建立了一个统一、可量化且可扩展的基础。所有评估数据和代码均可在 https://github.com/TCMAI-BJTU/LingLan 和 http://tcmnlp.com 获取。