Despite the dramatic progress in Large Language Model (LLM) development, LLMs often provide seemingly plausible but not factual information, often referred to as hallucinations. Retrieval-augmented LLMs provide a non-parametric approach to solve these issues by retrieving relevant information from external data sources and augment the training process. These models help to trace evidence from an externally provided knowledge base allowing the model predictions to be better interpreted and verified. In this work, we critically evaluate these models in their ability to perform in scientific document reasoning tasks. To this end, we tuned multiple such model variants with science-focused instructions and evaluated them on a scientific document reasoning benchmark for the usefulness of the retrieved document passages. Our findings suggest that models justify predictions in science tasks with fabricated evidence and leveraging scientific corpus as pretraining data does not alleviate the risk of evidence fabrication.
翻译:尽管大语言模型(LLM)的开发取得了显著进展,但其常提供看似合理但非事实的信息,即所谓的“幻觉”。检索增强型大语言模型通过从外部数据源检索相关信息并增强训练过程,提供了一种非参数化的解决方案。这类模型有助于从外部提供的知识库中追踪证据,使模型预测更易于解释和验证。在本研究中,我们批判性地评估了这些模型在科学文档推理任务中的表现。为此,我们针对多种模型变体进行了以科学为导向的指令微调,并在一个科学文档推理基准上评估了检索文档段落的有用性。研究结果表明,模型在科学任务中会基于捏造的证据来证明其预测,而将科学语料库作为预训练数据并不能降低证据捏造的风险。