We present Clinical Camel, an open large language model (LLM) explicitly tailored for clinical research. Fine-tuned from LLaMA-2 using QLoRA, Clinical Camel achieves state-of-the-art performance across medical benchmarks among openly available medical LLMs. Leveraging efficient single-GPU training, Clinical Camel surpasses GPT-3.5 in five-shot evaluations on all assessed benchmarks, including 64.3% on the USMLE Sample Exam (compared to 58.5% for GPT-3.5), 77.9% on PubMedQA (compared to 60.2%), 60.7% on MedQA (compared to 53.6%), and 54.2% on MedMCQA (compared to 51.0%). In addition to these benchmarks, Clinical Camel demonstrates its broader capabilities, such as synthesizing plausible clinical notes. This work introduces dialogue-based knowledge encoding, a novel method to synthesize conversational data from dense medical texts. While benchmark results are encouraging, extensive and rigorous human evaluation across diverse clinical scenarios is imperative to ascertain safety before implementation. By openly sharing Clinical Camel, we hope to foster transparent and collaborative research, working towards the safe integration of LLMs within the healthcare domain. Significant challenges concerning reliability, bias, and the potential for outdated knowledge persist. Nonetheless, the transparency provided by an open approach reinforces the scientific rigor essential for future clinical applications.
翻译:我们提出Clinical Camel,一个专门为临床研究定制的开放型大语言模型(LLM)。该模型基于LLaMA-2通过QLoRA技术进行微调,在公开可用的医学LLM中,于多项医学基准测试上取得了最先进的性能。凭借高效的单GPU训练,Clinical Camel在五项评估基准的五次少样本评估中均超越了GPT-3.5,包括在USMLE样题测试中达到64.3%(GPT-3.5为58.5%),在PubMedQA上达到77.9%(GPT-3.5为60.2%),在MedQA上达到60.7%(GPT-3.5为53.6%),以及在MedMCQA上达到54.2%(GPT-3.5为51.0%)。除这些基准测试外,Clinical Camel还展示了更广泛的能力,例如生成合理的临床记录。本研究引入了对话式知识编码——一种从密集医学文本中合成对话数据的新方法。尽管基准测试结果令人鼓舞,但在实际部署前,必须对不同临床场景进行广泛而严格的人工评估以确保安全性。通过公开分享Clinical Camel,我们期望促进透明且协作的研究,推动LLMs在医疗领域的安全整合。当前,关于可靠性、偏差以及知识过时风险等重大挑战依然存在。然而,开放方法所提供的透明度,强化了未来临床应用所必需的科研严谨性。