With the advancement of large language models (LLMs), the biomedical domain has seen significant progress and improvement in multiple tasks such as biomedical question answering, lay language summarization of the biomedical literature, clinical note summarization, etc. However, hallucinations or confabulations remain one of the key challenges when using LLMs in the biomedical and other domains. Inaccuracies may be particularly harmful in high-risk situations, such as medical question answering, making clinical decisions, or appraising biomedical research. Studies on the evaluation of the LLMs abilities to ground generated statements in verifiable sources have shown that models perform significantly worse on lay-user-generated questions, and often fail to reference relevant sources. This can be problematic when those seeking information want evidence from studies to back up the claims from LLMs. Unsupported statements are a major barrier to using LLMs in any applications that may affect health. Methods for grounding generated statements in reliable sources along with practical evaluation approaches are needed to overcome this barrier. Towards this, in our pilot task organized at TREC 2024, we introduced the task of reference attribution as a means to mitigate the generation of false statements by LLMs answering biomedical questions.
翻译:随着大语言模型(LLM)的发展,生物医学领域在多项任务上取得了显著进展与提升,例如生物医学问答、生物医学文献的通俗语言摘要、临床笔记摘要等。然而,在生物医学及其他领域使用LLM时,幻觉或虚构内容仍是关键挑战之一。在高风险场景下,例如医学问答、临床决策或评估生物医学研究时,不准确的信息可能尤其有害。关于评估LLM将生成陈述基于可验证来源的能力的研究表明,模型在处理普通用户生成的问题时表现明显更差,并且常常未能引用相关来源。当信息寻求者希望获得研究证据来支持LLM的主张时,这可能成为问题。无依据的陈述是在任何可能影响健康的应用中使用LLM的主要障碍。需要将生成陈述基于可靠来源的方法以及实用的评估方法来克服这一障碍。为此,我们在TREC 2024组织的试点任务中,引入了参考文献归因任务,作为缓解LLM在回答生物医学问题时生成虚假陈述的一种手段。