Appraising the Potential Uses and Harms of LLMs for Medical Systematic Reviews

Medical systematic reviews are crucial for informing clinical decision making and healthcare policy. But producing such reviews is onerous and time-consuming. Thus, high-quality evidence synopses are not available for many questions and may be outdated even when they are available. Large language models (LLMs) are now capable of generating long-form texts, suggesting the tantalizing possibility of automatically generating literature reviews on demand. However, LLMs sometimes generate inaccurate (and potentially misleading) texts by hallucinating or omitting important information. In the healthcare context, this may render LLMs unusable at best and dangerous at worst. Most discussion surrounding the benefits and risks of LLMs have been divorced from specific applications. In this work, we seek to qualitatively characterize the potential utility and risks of LLMs for assisting in production of medical evidence reviews. We conducted 16 semi-structured interviews with international experts in systematic reviews, grounding discussion in the context of generating evidence reviews. Domain experts indicated that LLMs could aid writing reviews, as a tool for drafting or creating plain language summaries, generating templates or suggestions, distilling information, crosschecking, and synthesizing or interpreting text inputs. But they also identified issues with model outputs and expressed concerns about potential downstream harms of confidently composed but inaccurate LLM outputs which might mislead. Other anticipated potential downstream harms included lessened accountability and proliferation of automatically generated reviews that might be of low quality. Informed by this qualitative analysis, we identify criteria for rigorous evaluation of biomedical LLMs aligned with domain expert views.

翻译：摘要：医学系统综述对于指导临床决策和医疗政策至关重要。然而，完成此类综述耗时且费力。因此，许多临床问题缺乏高质量的证据总结，即便已有综述也可能过时。大语言模型现能生成长篇文本，这为按需自动生成文献综述提供了诱人前景。但大语言模型有时会通过生成幻觉或遗漏重要信息而产生不准确（甚至具有误导性）的文本。在医疗保健领域，这可能使大语言模型在最佳情况下无法使用，在最坏情况下则可能产生危险。目前围绕大语言模型利弊的讨论大多脱离具体应用场景。本研究旨在定性描述大语言模型在协助制作医学证据综述方面的潜在效用与风险。我们与16位国际系统综述专家进行了半结构化访谈，聚焦于生成证据综述的具体场景展开讨论。领域专家指出，大语言模型可作为起草工具、生成通俗语言摘要、创建模板或建议、提炼信息、交叉验证及综合或解释文本输入等方面辅助综述撰写。但他们也指出了模型输出存在的问题，并对自信生成却可能误导用户的错误内容所带来的潜在下游危害表示担忧。其他可预见的潜在下游危害包括责任弱化以及自动生成的低质量综述泛滥。基于此定性分析，我们确定了符合领域专家观点的生物医学大语言模型严格评估标准。