A comprehensive qualitative evaluation framework for large language models (LLM) in healthcare that expands beyond traditional accuracy and quantitative metrics needed. We propose 5 key aspects for evaluation of LLMs: Safety, Consensus, Objectivity, Reproducibility and Explainability (S.C.O.R.E.). We suggest that S.C.O.R.E. may form the basis for an evaluation framework for future LLM-based models that are safe, reliable, trustworthy, and ethical for healthcare and clinical applications.
翻译:针对医疗健康领域的大型语言模型,亟需一个超越传统准确性与定量指标的综合性定性评估框架。我们提出了评估大型语言模型的五个关键维度:安全性、共识性、客观性、可复现性与可解释性(简称S.C.O.R.E.)。我们认为,S.C.O.R.E.框架可为未来基于大型语言模型的医疗健康与临床应用系统构建评估基础,以确保其安全性、可靠性、可信度与伦理合规性。