Recent work finds that retrieval-augmented generation with large language models is prone to be influenced by the order of retrieved documents in the context. However, the lack of in-depth analysis limits the use of this phenomenon for prompt engineering in practice. In this study, we posit that likelihoods serve as an effective gauge for language model performance. Through experiments on two question-answering datasets with a variety of state-of-the-art language models, we reveal correlations between answer accuracy and the likelihood of the question at both the corpus level and the instance level. In addition, we find that question likelihood can also indicate the position of the task-relevant information in the context. Based on these findings, we propose two methods that use question likelihood as a gauge for selecting and constructing prompts that lead to better performance. We demonstrate their effectiveness with experiments. In addition, our likelihood-based methods are efficient, as they only need to compute the likelihood of the input, requiring much fewer language model passes than heuristic prompt engineering methods that require generating responses. Our analysis deepens our understanding of how input prompts affect model performance and provides a promising direction for efficient prompt optimization.
翻译:近期研究发现,基于大语言模型的检索增强生成容易受到上下文中检索文档顺序的影响。然而,由于缺乏深入分析,这一现象在实际的提示工程中应用有限。在本研究中,我们提出似然可以作为语言模型性能的有效评估指标。通过在两个问答数据集上使用多种先进语言模型进行实验,我们揭示了在语料库层面和实例层面上答案准确率与问题似然之间的相关性。此外,我们发现问题似然还能指示上下文中任务相关信息的位置。基于这些发现,我们提出了两种利用问题似然作为评估指标来选择和构建提示的方法,这些方法能够带来更好的性能。我们通过实验证明了其有效性。此外,我们基于似然的方法具有高效性,因为它们仅需计算输入的似然,所需的语言模型前向传递次数远少于需要生成响应的启发式提示工程方法。我们的分析加深了对输入提示如何影响模型性能的理解,并为高效的提示优化提供了一个有前景的方向。