Large language models (LLMs) have made significant progress in various domains, including healthcare. However, the specialized nature of clinical language understanding tasks presents unique challenges and limitations that warrant further investigation. In this study, we conduct a comprehensive evaluation of state-of-the-art LLMs, namely GPT-3.5, GPT-4, and Bard, within the realm of clinical language understanding tasks. These tasks span a diverse range, including named entity recognition, relation extraction, natural language inference, semantic textual similarity, document classification, and question-answering. We also introduce a novel prompting strategy, self-questioning prompting (SQP), tailored to enhance LLMs' performance by eliciting informative questions and answers pertinent to the clinical scenarios at hand. Our evaluation underscores the significance of task-specific learning strategies and prompting techniques for improving LLMs' effectiveness in healthcare-related tasks. Additionally, our in-depth error analysis on the challenging relation extraction task offers valuable insights into error distribution and potential avenues for improvement using SQP. Our study sheds light on the practical implications of employing LLMs in the specialized domain of healthcare, serving as a foundation for future research and the development of potential applications in healthcare settings.
翻译:大型语言模型(LLMs)在包括医疗健康在内的多个领域取得了显著进展。然而,临床语言理解任务的专业性带来了独特挑战与局限性,亟需进一步探究。本研究对当前最先进的LLMs——GPT-3.5、GPT-4和Bard——在临床语言理解任务中进行了全面评估。这些任务涵盖命名实体识别、关系抽取、自然语言推理、语义文本相似度、文档分类及问答系统等多个领域。我们同时提出一种新型提示策略——自问提示(SQP),通过生成与临床场景相关的信息性问题与答案,针对性提升LLMs的性能。评估结果凸显了任务特定学习策略与提示技术在改善LLMs医疗健康相关任务有效性中的关键作用。此外,针对高难度关系抽取任务的深度错误分析,揭示了错误分布规律及利用SQP进行优化的潜在方向。本研究阐明了LLMs在医疗健康专业领域的实践应用价值,为该领域的后续研究及医疗场景应用开发奠定了基础。