Recent work has shown that language models' (LMs) prompt-based learning capabilities make them well suited for automating data labeling in domains where manual annotation is expensive. The challenge is that while writing an initial prompt is cheap, improving a prompt is costly -- practitioners often require significant labeled data in order to evaluate the impact of prompt modifications. Our work asks whether it is possible to improve prompt-based learning without additional labeled data. We approach this problem by attempting to modify the predictions of a prompt, rather than the prompt itself. Our intuition is that accurate predictions should also be consistent: samples which are similar under some feature representation should receive the same prompt prediction. We propose Embroid, a method which computes multiple representations of a dataset under different embedding functions, and uses the consistency between the LM predictions for neighboring samples to identify mispredictions. Embroid then uses these neighborhoods to create additional predictions for each sample, and combines these predictions with a simple latent variable graphical model in order to generate a final corrected prediction. In addition to providing a theoretical analysis of Embroid, we conduct a rigorous empirical evaluation across six different LMs and up to 95 different tasks. We find that (1) Embroid substantially improves performance over original prompts (e.g., by an average of 7.3 points on GPT-JT), (2) also realizes improvements for more sophisticated prompting strategies (e.g., chain-of-thought), and (3) can be specialized to domains like law through the embedding functions.
翻译:最近的研究表明,语言模型基于提示的学习能力使其非常适合在人工标注成本高昂的领域自动进行数据标注。挑战在于:编写初始提示成本较低,但改进提示代价高昂——实践者通常需要大量标注数据来评估提示修改的影响。本研究探讨能否在不增加标注数据的情况下改进基于提示的学习。我们通过尝试修改提示的预测结果而非提示本身来解决这一问题。我们的直觉是:准确的预测应具有一致性——即在某种特征表示下相似的样本应获得相同的提示预测。我们提出Embroid方法,该方法使用不同嵌入函数计算数据集的多种表示,并利用语言模型对邻近样本预测的一致性来识别错误预测。随后,Embroid利用这些邻域为每个样本生成额外预测,并通过简单的潜变量图模型整合这些预测,以生成最终修正预测。除提供Embroid的理论分析外,我们在六种不同语言模型及多达95项任务上进行了严格实证评估。研究发现:(1)Embroid显著提升了原始提示的性能(例如,在GPT-JT上平均提升7.3个百分点);(2)对更复杂的提示策略(如思维链)同样有效;(3)可通过嵌入函数专门应用于法律等领域。