Large language models (LLMs) have achieved widespread success on a variety of in-context few-shot tasks, but this success is typically evaluated via correctness rather than consistency. We argue that self-consistency is an important criteria for valid multi-step reasoning and propose two types of self-consistency that are particularly important for multi-step logic -- hypothetical consistency (the ability for a model to predict what its output would be in a hypothetical other context) and compositional consistency (consistency of a model's outputs for a compositional task even when an intermediate step is replaced with the model's output for that step). We demonstrate that four sizes of the GPT-3 model exhibit poor consistency rates across both types of consistency on four different tasks (Wikipedia, DailyDialog, arithmetic, and GeoQuery).
翻译:大语言模型在多种上下文少样本任务中取得了广泛成功,但这种成功通常通过正确性而非一致性来评估。我们认为自洽性是有效多步推理的重要标准,并提出了两种对多步逻辑尤为重要的自洽性——假设一致性(模型预测其在其他假设情境下输出结果的能力)和组合一致性(在组合任务中,即使中间步骤被替换为模型对该步骤的输出,模型输出仍保持一致)。我们证明,GPT-3模型的四种规模在四个不同任务(维基百科、每日对话、算术运算和GeoQuery)上,两种自洽性均表现出较低的一致性水平。