Large language models (LLMs) have revolutionized NLP by solving downstream tasks with little to no labeled data. Despite their versatile abilities, the larger question of their ability to reason remains ill-understood. This paper addresses reasoning in math word problems (MWPs) by studying symbolic versions of the numeric problems, since a symbolic expression is a "concise explanation" of the numeric answer. We create and use a symbolic version of the SVAMP dataset and find that GPT-3's davinci-002 model also has good zero-shot accuracy on symbolic MWPs. To evaluate the faithfulness of the model's reasoning, we go beyond accuracy and additionally evaluate the alignment between the final answer and the outputted reasoning, which correspond to numeric and symbolic answers respectively for MWPs. We explore a self-prompting approach to encourage the symbolic reasoning to align with the numeric answer, thus equipping the LLM with the ability to provide a concise and verifiable reasoning and making it more interpretable. Surprisingly, self-prompting also improves the symbolic accuracy to be higher than both the numeric and symbolic accuracies, thus providing an ensembling effect. The SVAMP_Sym dataset will be released for future research on symbolic math problems.
翻译:大型语言模型(LLMs)通过极少甚至无标注数据解决下游任务,彻底革新了自然语言处理领域。然而,尽管其能力多样,关于这些模型推理能力的核心问题仍未被充分理解。本文通过研究数值问题的符号化版本,探讨数学题中的推理能力——因为符号表达式是数值答案的“简洁解释”。我们创建并使用SVAMP数据集的符号化版本,发现GPT-3的davinci-002模型在符号化数学题上同样具有出色的零样本准确率。为评估模型推理的忠实性,我们超越准确率指标,进一步分析最终答案与输出推理之间的一致性——在数学题中分别对应数值答案与符号答案。我们探索了一种自提示方法,促使符号推理与数值答案对齐,从而使LLM能够提供简洁且可验证的推理过程,提升其可解释性。令人惊讶的是,自提示方法还使符号准确率同时超越数值和符号准确率,产生了集成效应。SVAMP_Sym数据集将公开发布,以支持未来符号数学题的相关研究。