Foundation Metrics: Quantifying Effectiveness of Healthcare Conversations powered by Generative AI

Mahyar Abbasian,Elahe Khatibi,Iman Azimi,David Oniani,Zahra Shakeri Hossein Abad,Alexander Thieme,Zhongqi Yang,Yanshan Wang,Bryant Lin,Olivier Gevaert,Li-Jia Li,Ramesh Jain,Amir M. Rahmani

from arxiv, 13 pages, 4 figures, 2 tables, journal paper

Generative Artificial Intelligence is set to revolutionize healthcare delivery by transforming traditional patient care into a more personalized, efficient, and proactive process. Chatbots, serving as interactive conversational models, will probably drive this patient-centered transformation in healthcare. Through the provision of various services, including diagnosis, personalized lifestyle recommendations, and mental health support, the objective is to substantially augment patient health outcomes, all the while mitigating the workload burden on healthcare providers. The life-critical nature of healthcare applications necessitates establishing a unified and comprehensive set of evaluation metrics for conversational models. Existing evaluation metrics proposed for various generic large language models (LLMs) demonstrate a lack of comprehension regarding medical and health concepts and their significance in promoting patients' well-being. Moreover, these metrics neglect pivotal user-centered aspects, including trust-building, ethics, personalization, empathy, user comprehension, and emotional support. The purpose of this paper is to explore state-of-the-art LLM-based evaluation metrics that are specifically applicable to the assessment of interactive conversational models in healthcare. Subsequently, we present an comprehensive set of evaluation metrics designed to thoroughly assess the performance of healthcare chatbots from an end-user perspective. These metrics encompass an evaluation of language processing abilities, impact on real-world clinical tasks, and effectiveness in user-interactive conversations. Finally, we engage in a discussion concerning the challenges associated with defining and implementing these metrics, with particular emphasis on confounding factors such as the target audience, evaluation methods, and prompt techniques involved in the evaluation process.

翻译：生成式人工智能将通过将传统患者护理转变为更个性化、高效和主动的过程，彻底改变医疗服务的提供方式。作为交互式对话模型，聊天机器人很可能推动这一以患者为中心的医疗变革。通过提供各种服务（包括诊断、个性化生活方式建议和心理健康支持），其目标是显著改善患者的健康结果，同时减轻医疗服务提供者的工作负担。医疗应用的生命关键性质要求为对话模型建立一套统一、全面的评估指标。现有为各种通用大型语言模型（LLM）提出的评估指标，缺乏对医学健康概念及其在促进患者福祉方面重要性的理解。此外，这些指标忽略了以用户为中心的关键方面，包括信任建立、伦理、个性化、共情、用户理解和情感支持。本文旨在探索专门适用于评估医疗保健领域交互式对话模型的最先进基于LLM的评估指标。随后，我们提出一套全面的评估指标，旨在从最终用户角度全面评估医疗保健聊天机器人的性能。这些指标包括对语言处理能力、对真实临床任务的影响以及用户交互对话有效性的评估。最后，我们讨论了与定义和实施这些指标相关的挑战，特别关注评估过程中涉及的混杂因素，如目标受众、评估方法和提示工程技术。