Large Language Models (LLMs) provide a possibility to make a great breakthrough in medicine. The establishment of a standardized medical benchmark becomes a fundamental cornerstone to measure progression. However, medical environments in different regions have their local characteristics, e.g., the ubiquity and significance of traditional Chinese medicine within China. Therefore, merely translating English-based medical evaluation may result in \textit{contextual incongruities} to a local region. To solve the issue, we propose a localized medical benchmark called CMB, a Comprehensive Medical Benchmark in Chinese, designed and rooted entirely within the native Chinese linguistic and cultural framework. While traditional Chinese medicine is integral to this evaluation, it does not constitute its entirety. Using this benchmark, we have evaluated several prominent large-scale LLMs, including ChatGPT, GPT-4, dedicated Chinese LLMs, and LLMs specialized in the medical domain. It is worth noting that our benchmark is not devised as a leaderboard competition but as an instrument for self-assessment of model advancements. We hope this benchmark could facilitate the widespread adoption and enhancement of medical LLMs within China. Check details in \url{https://cmedbenchmark.llmzoo.com/}.
翻译:大型语言模型(LLMs)为医学领域实现重大突破提供了可能。标准化医学基准的建立成为衡量进展的基石。然而,不同地区的医疗环境具有各自的地域特征,例如中医在中国医疗体系中的普遍性与重要性。因此,单纯翻译英文医学评估可能会在特定地区产生"语境不协调"的问题。为解决这一问题,我们提出了一个本土化医学基准——CMB(中文综合医学基准),该基准完全基于中文语言和文化框架设计并构建。虽然中医是该基准的重要组成部分,但并非其全部内容。我们利用这一基准评估了多个主流大型语言模型,包括ChatGPT、GPT-4、专用中文LLMs以及医学领域专用LLMs。值得注意的是,本基准并非用于构建排行榜竞争,而是作为模型进展的自我评估工具。我们希望该基准能够推动中国医学LLMs的广泛应用与优化。详情请见\url{https://cmedbenchmark.llmzoo.com/}。