Large language models (LLMs) have made significant progress in various domains, including healthcare. However, the specialized nature of clinical language understanding tasks presents unique challenges and limitations that warrant further investigation. In this study, we conduct a comprehensive evaluation of state-of-the-art LLMs, namely GPT-3.5, GPT-4, and Bard, within the realm of clinical language understanding tasks. These tasks span a diverse range, including named entity recognition, relation extraction, natural language inference, semantic textual similarity, document classification, and question-answering. We also introduce a novel prompting strategy, self-questioning prompting (SQP), tailored to enhance LLMs' performance by eliciting informative questions and answers pertinent to the clinical scenarios at hand. Our evaluation underscores the significance of task-specific learning strategies and prompting techniques for improving LLMs' effectiveness in healthcare-related tasks. Additionally, our in-depth error analysis on the challenging relation extraction task offers valuable insights into error distribution and potential avenues for improvement using SQP. Our study sheds light on the practical implications of employing LLMs in the specialized domain of healthcare, serving as a foundation for future research and the development of potential applications in healthcare settings.
翻译:大型语言模型(LLMs)在包括医疗保健在内的多个领域取得了显著进展。然而,临床语言理解任务的专业化特性带来了独特的挑战与局限性,亟需进一步探究。本研究对当前最先进的大型语言模型(GPT-3.5、GPT-4与Bard)在临床语言理解任务中进行了全面评估。这些任务涵盖命名实体识别、关系抽取、自然语言推理、语义文本相似度、文档分类及问答等多个类别。我们同时提出了一种新型提示策略——自问提示(SQP),通过生成与当前临床场景相关的信息性问答对,专门增强LLMs的性能。本评估凸显了在医疗相关任务中,任务特异性学习策略与提示技术对于提升LLMs效能的关键作用。此外,针对具有挑战性的关系抽取任务所开展的深度错误分析,揭示了错误分布特征及利用SQP进行改进的潜在路径。本研究揭示了在医疗保健这一专业领域部署LLMs的实践意义,为未来研究及医疗场景中潜在应用开发奠定了坚实基础。