Large Language Models (LLMs) provide a possibility to make a great breakthrough in medicine. The establishment of a standardized medical benchmark becomes a fundamental cornerstone to measure progression. However, medical environments in different regions have their local characteristics, e.g., the ubiquity and significance of traditional Chinese medicine within China. Therefore, merely translating English-based medical evaluation may result in \textit{contextual incongruities} to a local region. To solve the issue, we propose a localized medical benchmark called CMB, a Comprehensive Medical Benchmark in Chinese, designed and rooted entirely within the native Chinese linguistic and cultural framework. While traditional Chinese medicine is integral to this evaluation, it does not constitute its entirety. Using this benchmark, we have evaluated several prominent large-scale LLMs, including ChatGPT, GPT-4, dedicated Chinese LLMs, and LLMs specialized in the medical domain. We hope this benchmark provide first-hand experience in existing LLMs for medicine and also facilitate the widespread adoption and enhancement of medical LLMs within China. Our data and code are publicly available at https://github.com/FreedomIntelligence/CMB.
翻译:大型语言模型(LLMs)为医学领域的重大突破提供了可能性。标准化医学基准的建立成为衡量进展的基石。然而,不同地区的医疗环境具有地方特色,例如中医药在中国的普遍存在及其重要意义。因此,单纯翻译英文医学评估可能导致对当地语境的格格不入。为解决该问题,我们提出一个本地化的医学基准——CMB(中文综合医学基准),其设计完全植根于中国本土语言与文化框架。尽管中医药是该评估的组成部分,但并非其全部内容。利用该基准,我们评估了多个主流大型语言模型,包括ChatGPT、GPT-4、专用中文LLMs以及医学领域专用LLMs。希望该基准能为现有医学LLMs提供第一手经验,并推动中国医学LLMs的广泛采用与改进。我们的数据和代码已公开于https://github.com/FreedomIntelligence/CMB。