Despite recent progress, it has been difficult to prevent semantic hallucinations in generative Large Language Models. One common solution to this is augmenting LLMs with a retrieval system and making sure that the generated output is attributable to the retrieved information. Given this new added constraint, it is plausible to expect that the overall quality of the output will be affected, for example, in terms of fluency. Can scaling language models help? Here we examine the relationship between fluency and attribution in LLMs prompted with retrieved evidence in knowledge-heavy dialog settings. Our experiments were implemented with a set of auto-metrics that are aligned with human preferences. They were used to evaluate a large set of generations, produced under varying parameters of LLMs and supplied context. We show that larger models tend to do much better in both fluency and attribution, and that (naively) using top-k retrieval versus top-1 retrieval improves attribution but hurts fluency. We next propose a recipe that could allow smaller models to both close the gap with larger models and preserve the benefits of top-k retrieval while avoiding its drawbacks.
翻译:尽管近期取得了进展,但在生成式大语言模型中仍难以完全消除语义幻觉。将检索系统与大语言模型相结合,并确保生成输出可归于检索信息是常见的解决方案之一。在此新增约束条件下,可以预期输出整体质量(例如流畅性)将受到影响。扩大语言模型规模能否改善这一状况?本文研究了在知识密集型对话场景中,基于检索证据进行提示的大语言模型在流畅性与归因性之间的关系。我们采用与人类偏好对齐的自动评估指标开展实验,对在不同大语言模型参数及上下文条件下生成的大规模文本集合进行评测。研究表明:较大规模模型在流畅性和归因性两方面均表现更优;简单采用top-k检索(相较top-1检索)虽能提升归因性但会损害流畅性。我们进一步提出优化方案,使较小规模模型既能缩小与大规模模型的性能差距,又能保留top-k检索的优势并规避其缺陷。