This study examines the effect of prompt engineering on the performance of Large Language Models (LLMs) in clinical note generation. We introduce an Automatic Prompt Optimization (APO) framework to refine initial prompts and compare the outputs of medical experts, non-medical experts, and APO-enhanced GPT3.5 and GPT4. Results highlight GPT4 APO's superior performance in standardizing prompt quality across clinical note sections. A human-in-the-loop approach shows that experts maintain content quality post-APO, with a preference for their own modifications, suggesting the value of expert customization. We recommend a two-phase optimization process, leveraging APO-GPT4 for consistency and expert input for personalization.
翻译:本研究探讨了提示工程对大型语言模型(LLMs)在临床笔记生成中性能的影响。我们提出了一种自动提示优化(APO)框架,用于改进初始提示,并比较了医学专家、非医学专家以及经APO增强的GPT3.5和GPT4的输出结果。结果表明,GPT4的APO在标准化临床笔记各部分的提示质量方面表现优异。通过人机协同的方法,专家在APO后仍能保持内容质量,但更倾向于使用自己的修改版本,这表明专家定制化的重要性。我们推荐采用两阶段优化流程,结合APO-GPT4保证一致性,并通过专家输入实现个性化。