Previous works in prompt engineering for large language models have introduced different gradient-free probability-based prompt selection methods that aim to choose the optimal prompt among the candidates for a given task but have failed to provide a comprehensive and fair comparison between each other. In this paper, we propose a unified framework to interpret and evaluate the existing probability-based prompt selection methods by performing extensive experiments on 13 common and diverse NLP tasks. We find that each of the existing methods can be interpreted as some variant of the method that maximizes mutual information between the input and the predicted output (MI). Utilizing this finding, we develop several other combinatorial variants of MI and increase the effectiveness of the oracle prompt selection method from 87.79% to 94.98%, measured as the ratio of the performance of the selected prompt to that of the optimal oracle prompt. Furthermore, considering that all the methods rely on the output probability distribution of the model that might be biased, we propose a novel calibration method called Calibration by Marginalization (CBM) that is orthogonal to the existing methods and helps increase the prompt selection effectiveness of the best method to 96.85%, achieving 99.44% of the oracle prompt F1 without calibration.
翻译:先前在大型语言模型提示工程中的研究引入了多种无梯度概率式提示选择方法,旨在为给定任务从候选提示中选出最优方案,但未能提供彼此间的全面公平比较。本文提出一个统一框架,通过对13个多样化的常见自然语言处理任务进行广泛实验,对现有概率式提示选择方法进行解释与评估。我们发现,每种现有方法均可解释为最大化输入与预测输出之间互信息(MI)方法的某种变体。基于此发现,我们开发了MI的多种其他组合变体,将预言提示选择方法的有效性从87.79%提升至94.98%(以所选提示性能与最优预言提示性能之比衡量)。此外,考虑到所有方法均依赖于可能存在偏差的模型输出概率分布,我们提出一种名为边际化校准(CBM)的新型校准方法,该方法独立于现有方法,能将最佳方法的提示选择有效性提升至96.85%,在未校准情况下达到预言提示F1值的99.44%。