Appraising the Potential Uses and Harms of LLMs for Medical Systematic Reviews

Medical systematic reviews are crucial for informing clinical decision making and healthcare policy. But producing such reviews is onerous and time-consuming. Thus, high-quality evidence synopses are not available for many questions and may be outdated even when they are available. Large language models (LLMs) are now capable of generating long-form texts, suggesting the tantalizing possibility of automatically generating literature reviews on demand. However, LLMs sometimes generate inaccurate (and potentially misleading) texts by hallucinating or omitting important information. In the healthcare context, this may render LLMs unusable at best and dangerous at worst. Most discussion surrounding the benefits and risks of LLMs have been divorced from specific applications. In this work, we seek to qualitatively characterize the potential utility and risks of LLMs for assisting in production of medical evidence reviews. We conducted 16 semi-structured interviews with international experts in systematic reviews, grounding discussion in the context of generating evidence reviews. Domain experts indicated that LLMs could aid writing reviews, as a tool for drafting or creating plain language summaries, generating templates or suggestions, distilling information, crosschecking, and synthesizing or interpreting text inputs. But they also identified issues with model outputs and expressed concerns about potential downstream harms of confidently composed but inaccurate LLM outputs which might mislead. Other anticipated potential downstream harms included lessened accountability and proliferation of automatically generated reviews that might be of low quality. Informed by this qualitative analysis, we identify criteria for rigorous evaluation of biomedical LLMs aligned with domain expert views.

翻译：医学系统综述对于指导临床决策和医疗政策至关重要。然而，撰写此类综述耗时费力，导致许多领域缺乏高质量证据总结，即便已有综述也可能过时。大型语言模型（LLMs）当前能够生成长篇文本，这暗示了按需自动生成文献综述的诱人前景。但LLMs有时会通过虚构或遗漏重要信息生成不准确（甚至可能具有误导性）的文本。在医疗健康领域，这可能使LLMs在最佳情况下无法使用，在最坏情况下则具有危险性。目前关于LLM益处与风险的讨论大多脱离具体应用场景。本研究旨在定性分析LLMs辅助医学证据综述生产的潜在效用与风险。我们通过对16位国际系统综述专家进行半结构化访谈，围绕生成证据综述的具体场景展开讨论。领域专家指出，LLMs可作为起草或创建通俗语言摘要、生成模板或建议、提炼信息、交叉核对以及综合或解释文本输入的辅助工具。但他们也识别出模型输出存在的问题，并对自信但可能不准确的LLM输出在误导性方面产生的下游危害表示担忧。其他预期的潜在下游危害包括责任弱化以及自动生成的低质量综述的泛滥。基于此定性分析，我们确定了与领域专家观点一致的生物医学LLM严格评估标准。