ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels

Large language models (LLMs) are increasingly applied to health management, showing promise across disease prevention, clinical decision-making, and long-term care. However, existing medical benchmarks remain largely static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world clinical workflows. We introduce ClinConsensus, a Chinese medical benchmark curated, validated, and quality-controlled by clinical experts. ClinConsensus comprises 2500 open-ended cases spanning the full continuum of care--from prevention and intervention to long-term follow-up--covering 36 medical specialties, 12 common clinical task types, and progressively increasing levels of complexity. To enable reliable evaluation of such complex scenarios, we adopt a rubric-based grading protocol and propose the Clinically Applicable Consistency Score (CACS@k). We further introduce a dual-judge evaluation framework, combining a high-capability LLM-as-judge with a distilled, locally deployable judge model trained via supervised fine-tuning, enabling scalable and reproducible evaluation aligned with physician judgment. Using ClinConsensus, we conduct a comprehensive assessment of several leading LLMs and reveal substantial heterogeneity across task themes, care stages, and medical specialties. While top-performing models achieve comparable overall scores, they differ markedly in reasoning, evidence use, and longitudinal follow-up capabilities, and clinically actionable treatment planning remains a key bottleneck. We release ClinConsensus as an extensible benchmark to support the development and evaluation of medical LLMs that are robust, clinically grounded, and ready for real-world deployment.

翻译：大语言模型在健康管理中的应用日益广泛，在疾病预防、临床决策和长期护理等领域展现出潜力。然而，现有医疗评估基准大多为静态且任务孤立的，未能体现真实临床工作流程的开放性、纵向结构及安全关键复杂性。本文提出ClinConsensus——一个由临床专家参与构建、验证与质控的中文医疗评估基准。该基准包含2500个开放式案例，覆盖从预防、干预到长期随访的全周期诊疗过程，涵盖36个医学专科、12种常见临床任务类型，并设置了渐进递增的复杂度层级。为可靠评估此类复杂场景，我们采用基于量规的评分协议，并提出临床适用一致性评分（CACS@k）。进一步，我们设计了双裁判评估框架，将高性能LLM-as-judge与通过监督微调训练的、可本地部署的蒸馏裁判模型相结合，实现与医师判断对齐的可扩展、可复现评估。基于ClinConsensus，我们对多个主流大语言模型进行了全面评估，发现模型在不同任务主题、诊疗阶段及医学专科间存在显著异质性。尽管表现最优的模型总体得分相近，但在推理能力、证据运用和纵向随访方面差异明显，且具有临床可操作性的治疗规划仍是关键瓶颈。我们公开发布ClinConsensus作为可扩展的基准，以支持开发与评估具有鲁棒性、临床基础扎实且适用于真实场景部署的医疗大语言模型。