Large language models (LLMs) have shown considerable potential in supporting medical diagnosis. However, their effective integration into clinical workflows is hindered by physicians' difficulties in perceiving and trusting LLM capabilities, which often results in miscalibrated trust. Existing model evaluations primarily emphasize standardized benchmarks and predefined tasks, offering limited insights into clinical reasoning practices. Moreover, research on human-AI collaboration has rarely examined physicians' perceptions of LLMs' clinical reasoning capability. In this work, we investigate how physicians perceive LLMs' capabilities in the clinical reasoning process. We designed clinical cases, collected the corresponding analyses, and obtained evaluations from physicians (N=37) to quantitatively represent their perceived LLM diagnostic capabilities. By comparing the perceived evaluations with benchmark performance, our study highlights the aspects of clinical reasoning that physicians value and underscores the limitations of benchmark-based evaluation. We further discuss the implications of opportunities for enhancing trustworthy collaboration between physicians and LLMs in LLM-supported clinical reasoning.
翻译:大型语言模型(LLMs)在支持医学诊断方面展现出巨大潜力。然而,由于医生难以感知和信任LLM的能力,导致信任度校准不当,阻碍了其有效融入临床工作流程。现有的模型评估主要侧重于标准化基准和预定义任务,对临床推理实践的洞察有限。此外,关于人机协作的研究很少考察医生对LLM临床推理能力的感知。本研究探讨了医生在临床推理过程中如何感知LLM的能力。我们设计了临床案例,收集了相应的分析,并获取了医生(N=37)的评估,以量化表示他们感知到的LLM诊断能力。通过将感知评估与基准性能进行比较,我们的研究揭示了医生所重视的临床推理方面,并强调了基于基准评估的局限性。我们进一步讨论了在LLM支持的临床推理中,加强医生与LLM之间可信赖协作的机遇与启示。