Recent work has shown how to prompt large language models with explanations to obtain strong performance on textual reasoning tasks, i.e., the chain-of-thought paradigm. However, subtly different explanations can yield widely varying downstream task accuracy. Explanations that have not been "tuned" for a task, such as off-the-shelf explanations written by nonexperts, may lead to mediocre performance. This paper tackles the problem of how to optimize explanation-infused prompts in a blackbox fashion. We first generate sets of candidate explanations for each example in the prompt using a leave-one-out scheme, then find an effective combination of these explanations with a two-stage framework. We first evaluate explanations for each in-context example in isolation according to two proxy metrics, log likelihood and accuracy on new examples. Then, we search over combinations of explanations to find one that yields high performance against a silver-labeled development set. Across four textual reasoning tasks spanning question answering, mathematical reasoning, and natural language inference, results show that our proxy metrics correlate with ground truth accuracy and our overall method can effectively improve prompts over crowdworker annotations and naive search strategies
翻译:近期研究展示了如何通过向大型语言模型提供解释来提示其执行文本推理任务,即思维链范式。然而,细微差别的解释可能导致下游任务准确率出现显著差异。未经任务"调优"的解释(例如非专家撰写的现成解释)可能产生平庸的性能。本文探讨如何在黑盒场景下优化包含解释的提示。我们首先采用留一法为提示中的每个示例生成候选解释集,随后通过两阶段框架寻找这些解释的有效组合:第一阶段根据对数似然和对新示例的准确率两项代理指标,分别评估每个上下文示例的解释;第二阶段则搜索解释组合,以在银标签开发集上获得高性能。在涵盖问答、数学推理和自然语言推理的四项文本推理任务中,实验结果表明:我们的代理指标与真实准确率具有相关性,且整体方法能有效提升提示性能,优于众包标注和朴素搜索策略。