LLMs are revolutionizing NLP tasks. However, the most powerful LLM, like GPT-4, is too costly for most domain-specific scenarios. We present the first continuously trained 13B Llama2-based LLM that is purpose-built for medical conversations and measured on automated scribing. Our results show that our model outperforms GPT-4 in PubMedQA with 76.6\% accuracy and matches its performance in summarizing medical conversations into SOAP notes. Notably, our model exceeds GPT-4 in capturing a higher number of correct medical concepts and outperforms human scribes with higher correctness and completeness.
翻译:LLMs正在革新NLP任务。然而,最强大的LLM(如GPT-4)在大多数特定领域场景中成本过高。我们提出了首个基于Llama2的13B参数持续训练LLM,专为医疗对话设计,并针对自动记录生成进行评测。结果表明,我们的模型在PubMedQA上以76.6%的准确率超越GPT-4,并在将医疗对话总结为SOAP笔记方面与其性能持平。值得注意的是,我们的模型在捕获更多正确医疗概念方面超过GPT-4,且生成的记录在正确性和完整性上均优于人工记录员。