Instruction-tuned Large Language Models (LLMs) can perform a wide range of tasks given natural language instructions to do so, but they are sensitive to how such instructions are phrased. This issue is especially concerning in healthcare, as clinicians are unlikely to be experienced prompt engineers and the potential consequences of inaccurate outputs are heightened in this domain. This raises a practical question: How robust are instruction-tuned LLMs to natural variations in the instructions provided for clinical NLP tasks? We collect prompts from medical doctors across a range of tasks and quantify the sensitivity of seven LLMs -- some general, others specialized -- to natural (i.e., non-adversarial) instruction phrasings. We find that performance varies substantially across all models, and that -- perhaps surprisingly -- domain-specific models explicitly trained on clinical data are especially brittle, compared to their general domain counterparts. Further, arbitrary phrasing differences can affect fairness, e.g., valid but distinct instructions for mortality prediction yield a range both in overall performance, and in terms of differences between demographic groups.
翻译:指令调优的大语言模型(LLMs)能够根据自然语言指令执行广泛任务,但其表现对指令的具体措辞十分敏感。这一问题在医疗健康领域尤为值得关注,因为临床医生不太可能成为经验丰富的提示工程师,且该领域中输出不准确的潜在后果更为严重。这引发了一个实际问题:针对临床自然语言处理任务,指令调优的LLMs对指令的自然变化具有多大的鲁棒性?我们收集了医生针对一系列任务编写的提示,并量化了七个LLM(包括通用模型和专用模型)对自然(即非对抗性)指令措辞的敏感性。研究发现,所有模型的性能均存在显著波动,且令人意外的是,与通用领域模型相比,那些经过临床数据专门训练的领域特定模型尤其脆弱。此外,随意的措辞差异还可能影响公平性,例如,在死亡率预测任务中,有效但不同的指令会导致整体性能的波动,并引发不同人口统计群体间的差异。