In digital healthcare, large language models (LLMs) have primarily been utilized to enhance question-answering capabilities and improve patient interactions. However, effective patient care necessitates LLM chains that can actively gather information by posing relevant questions. This paper presents HealthQ, a novel framework designed to evaluate the questioning capabilities of LLM healthcare chains. We implemented several LLM chains, including Retrieval-Augmented Generation (RAG), Chain of Thought (CoT), and reflective chains, and introduced an LLM judge to assess the relevance and informativeness of the generated questions. To validate HealthQ, we employed traditional Natural Language Processing (NLP) metrics such as Recall-Oriented Understudy for Gisting Evaluation (ROUGE) and Named Entity Recognition (NER)-based set comparison, and constructed two custom datasets from public medical note datasets, ChatDoctor and MTS-Dialog. Our contributions are threefold: we provide the first comprehensive study on the questioning capabilities of LLMs in healthcare conversations, develop a novel dataset generation pipeline, and propose a detailed evaluation methodology.
翻译:在数字医疗领域,大型语言模型(LLMs)主要被用于增强问答能力并改善患者交互。然而,有效的患者护理需要能够通过提出相关问题主动收集信息的LLM链。本文提出HealthQ,一个旨在评估医疗LLM链提问能力的新颖框架。我们实现了多种LLM链,包括检索增强生成(RAG)、思维链(CoT)和反思链,并引入了一个LLM评判器来评估生成问题的相关性和信息量。为验证HealthQ,我们采用了传统的自然语言处理(NLP)指标,如面向召回率的摘要评估替代指标(ROUGE)和基于命名实体识别(NER)的集合比较,并利用公开医疗记录数据集ChatDoctor和MTS-Dialog构建了两个定制数据集。我们的贡献有三方面:提供了首个关于LLMs在医疗对话中提问能力的全面研究,开发了一个新颖的数据集生成流程,并提出了一套详细的评估方法。