This study examines the effect of prompt engineering on the performance of Large Language Models (LLMs) in clinical note generation. We introduce an Automatic Prompt Optimization (APO) framework to refine initial prompts and compare the outputs of medical experts, non-medical experts, and APO-enhanced GPT3.5 and GPT4. Results highlight GPT4 APO's superior performance in standardizing prompt quality across clinical note sections. A human-in-the-loop approach shows that experts maintain content quality post-APO, with a preference for their own modifications, suggesting the value of expert customization. We recommend a two-phase optimization process, leveraging APO-GPT4 for consistency and expert input for personalization.
翻译:本研究探讨了提示工程对大型语言模型(LLM)在临床记录生成任务中性能的影响。我们引入了一种自动提示优化(APO)框架,用于优化初始提示,并比较了医学专家、非医学专家以及经过APO增强的GPT3.5和GPT4模型的输出结果。结果表明,GPT4 APO在统一临床记录各部分的提示质量方面表现最优。通过人机协同方法发现,专家在APO优化后仍能保持内容质量,且倾向于采用自身修改后的版本,这体现了专家定制化的重要价值。我们建议采用两阶段优化流程:利用APO-GPT4确保一致性,并结合专家输入实现个性化调整。