Large language models (LLMs) have achieved widespread success on a variety of in-context few-shot tasks, but this success is typically evaluated via correctness rather than consistency. We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps. We propose two types of self-consistency that are particularly important for multi-step reasoning -- hypothetical consistency (a model's ability to predict what its output would be in a hypothetical other context) and compositional consistency (consistency of a model's final outputs when intermediate sub-steps are replaced with the model's outputs for those steps). We demonstrate that multiple variants of the GPT-3/-4 models exhibit poor consistency rates across both types of consistency on a variety of tasks.
翻译:大型语言模型(LLMs)在多种上下文少样本任务上取得了广泛成功,但这种成功通常是通过正确性而非一致性来评估的。我们认为,在那些解决方案由多个子步骤的答案组成的任务中,自洽性是有效多步推理的重要标准。我们提出了两种对多步推理尤为重要的自洽性——假设一致性(模型预测其在假设的其他上下文中的输出的能力)和组合一致性(当中间子步骤被模型针对这些步骤的输出替换后,模型最终输出的一致性)。我们证明,GPT-3/-4模型的多个变体在多种任务上,对这两种一致性均表现出较低的一致性比率。